Objective
Measure whether Metis-backed content is provably better than LLM-only content for healthcare writing, using parentguidebook as the test bed. The goal is not a vibes comparison ("which reads better?") but a metrics-based evaluation that produces numbers we can publish.
The three metrics
1. Hallucination rate
How many claims in the article have no backing in any source?
Method: For each article, extract every factual claim (manually or with an LLM evaluator). For each claim, check:
- LLM-only: can the claim be traced to a real, verifiable source? (Google it, check PubMed, etc.)
- LLM+Metis: does the claim cite a Metis atom? Does the atom have a verified quotedSpan? Does the quotedSpan exist in the source file?
Metric: hallucination_rate = unverifiable_claims / total_claims
2. Consistency rate
Do articles contradict each other?
Method: After producing all 20 articles, extract claims that overlap in topic. Check if any two articles make contradictory statements about the same subject.
- LLM-only: each article is generated independently — contradictions are likely
- LLM+Metis: all articles draw from the same knowledge graph — contradictions should be flagged by Metis before they reach the article
Metric: contradiction_count across the full set of articles
3. Production time
How long does it take to produce a publishable, cited article?
Method: Time the full workflow from "topic decided" to "article with citations ready for review."
- LLM-only: prompt engineering + generation + manual fact-checking + manual citation
- LLM+Metis: metis apply + chatbot composition (citations come free)
Metric: hours_per_article (include fact-checking time for LLM-only)
Test design
Source material
Build a Metis project with 5-10 healthcare sources on child safety topics:
- AAP (American Academy of Pediatrics) guidelines on infant sleep
- WHO feeding recommendations
- CDC immunization schedules
- 2-3 peer-reviewed papers on common parenting safety topics
- 1-2 parenting guidebooks (EPUB or PDF)
Total: ~5-10 sources, enough to create a meaningful knowledge graph.
Profile: strict — this is healthcare content. Full provenance required.
Article topics
10 topics, each written twice (once LLM-only, once LLM+Metis):
- Safe sleep practices for newborns
- When and how to introduce solid foods
- Common infant choking hazards and prevention
- Fever management in children under 2
- Car seat safety by age and weight
- Childhood vaccination schedule and common concerns
- Baby-proofing your home room by room
- Safe bathing practices for infants
- When to call the doctor: warning signs in infants
- Sun safety for babies and toddlers
Control conditions
- LLM-only: The chatbot receives only the topic. No Metis context. It uses its training data.
- LLM+Metis: The chatbot receives the topic + the full Metis apply output for a relevant query. Same chatbot, same model, same system prompt (except for the Metis context block).
- Same model for both: Use the same LLM (e.g., Claude Sonnet or Kimi) so the only variable is whether Metis context is provided.
- Same evaluator for both: The person scoring hallucinations and contradictions should not know which article is which (blind evaluation if possible).
Evaluation process
Phase 1: Build the knowledge base (~1-2 weeks)
- Collect 5-10 healthcare sources (EPUBs, PDFs once supported, markdown)
- Run
metis learn with strict profile
- Verify atom count, quotedSpans coverage, entity graph quality
- Spot-check 10-20 atoms manually: are the quotes real? Are the claims accurate?
Phase 2: Generate articles (~1 week)
- For each topic, run
metis apply with a relevant query
- Feed the Metis output to the chatbot as context, generate the LLM+Metis article
- Generate the LLM-only article with the same chatbot, no Metis context
- Save all 20 articles with metadata (which is which, generation time, model used)
Phase 3: Evaluate (~1-2 weeks)
- Extract claims from all 20 articles (aim for 15-30 claims per article)
- Score each claim: verifiable (with source) or unverifiable (hallucination)
- For LLM+Metis claims: verify the Metis provenance chain (atom → quotedSpan → source)
- Cross-check all 20 articles for contradictions
- Record time spent on each article including any fact-checking
Phase 4: Report
- Compile metrics: hallucination rate, contradiction count, hours per article
- Include specific examples: "LLM-only said X about fever management, but the AAP actually says Y"
- Include the full provenance chain for 2-3 LLM+Metis claims as demonstration
- Publish findings (blog post, GitHub discussion, or paper)
Expected outcomes (hypotheses)
| Metric |
LLM-only (expected) |
LLM+Metis (expected) |
| Hallucination rate |
10-20% of claims |
<5% (only in chatbot composition layer) |
| Contradictions across 10 articles |
3-8 |
0-1 |
| Hours per article (with fact-checking) |
2-4 hours |
0.5-1 hour |
If these hypotheses hold, the delta is publishable and sellable:
- 4x fewer hallucinations
- Near-zero contradictions
- 3-4x faster production
Prerequisites
Definition of done
- 20 articles generated (10 pairs)
- All claims scored for verifiability
- Contradiction analysis complete
- Production time recorded
- Results documented with specific examples
- At least one blog post or report published with findings
Why this matters
This test produces the proof that turns "Metis adds rigor" from a claim into a measurement. If the numbers hold, this becomes the sales deck for every market we pursue: "In a blind test on healthcare content, LLM+Metis produced 4x fewer hallucinations and zero contradictions across 10 articles." That's not a vibes argument. That's data.
Objective
Measure whether Metis-backed content is provably better than LLM-only content for healthcare writing, using parentguidebook as the test bed. The goal is not a vibes comparison ("which reads better?") but a metrics-based evaluation that produces numbers we can publish.
The three metrics
1. Hallucination rate
How many claims in the article have no backing in any source?
Method: For each article, extract every factual claim (manually or with an LLM evaluator). For each claim, check:
Metric:
hallucination_rate = unverifiable_claims / total_claims2. Consistency rate
Do articles contradict each other?
Method: After producing all 20 articles, extract claims that overlap in topic. Check if any two articles make contradictory statements about the same subject.
Metric:
contradiction_countacross the full set of articles3. Production time
How long does it take to produce a publishable, cited article?
Method: Time the full workflow from "topic decided" to "article with citations ready for review."
Metric:
hours_per_article(include fact-checking time for LLM-only)Test design
Source material
Build a Metis project with 5-10 healthcare sources on child safety topics:
Total: ~5-10 sources, enough to create a meaningful knowledge graph.
Profile:
strict— this is healthcare content. Full provenance required.Article topics
10 topics, each written twice (once LLM-only, once LLM+Metis):
Control conditions
Evaluation process
Phase 1: Build the knowledge base (~1-2 weeks)
metis learnwithstrictprofilePhase 2: Generate articles (~1 week)
metis applywith a relevant queryPhase 3: Evaluate (~1-2 weeks)
Phase 4: Report
Expected outcomes (hypotheses)
If these hypotheses hold, the delta is publishable and sellable:
Prerequisites
Definition of done
Why this matters
This test produces the proof that turns "Metis adds rigor" from a claim into a measurement. If the numbers hold, this becomes the sales deck for every market we pursue: "In a blind test on healthcare content, LLM+Metis produced 4x fewer hallucinations and zero contradictions across 10 articles." That's not a vibes argument. That's data.