Research note: Climate science domain benchmark — first cross-domain eval run #12
base76-research-lab
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
First benchmark run against an external domain (climate science), contributed by @harshareddy832 in #9. This note documents what we found, what broke, and what it revealed about the architecture.
Setup
eval/domains/climate_science.json)cerebras/llama3.1-8b(small baseline),groq/llama-3.3-70b-versatile(large baseline + NoUse run)groq/llama-3.3-70b-versatilebrain.add()— 120 relations inserted from Q&A pairs and key conceptsResults
cerebras/llama3.1-8bgroq/llama-3.3-70b-versatilegroq/llama-3.3-70b-versatileAggregate scores are low and not the main finding — see below.
What actually happened
Individual question result (climate_009)
Run B explicitly cited the graph: the model knew what it knew and why — with an evidence score attached. That is the intended behavior.
What confounded the aggregate scores
[ERROR], scored 0.What this revealed: the cold-start problem
NoUse cannot help in a domain it has never seen. This is expected — but it means cross-domain benchmarking requires domain seeding first.
This prompted two new architectural issues:
modelsessionsdomain: cross-model epistemic cache with zero-token replay and Hebbian correlationThe hypothesis from #10: if NoUse seeds domain knowledge reactively when a query misses, aggregate scores should converge toward the internal benchmark results.
Next steps
Null results and methodology failures are as useful as wins. If you replicate this or run it on a different domain, post results here.
Beta Was this translation helpful? Give feedback.
All reactions