Skip to content

Test plan: LLM-only vs LLM+Metis healthcare article quality comparison #17

@0b00101111

Description

@0b00101111

Objective

Measure whether Metis-backed content is provably better than LLM-only content for healthcare writing, using parentguidebook as the test bed. The goal is not a vibes comparison ("which reads better?") but a metrics-based evaluation that produces numbers we can publish.

The three metrics

1. Hallucination rate

How many claims in the article have no backing in any source?

Method: For each article, extract every factual claim (manually or with an LLM evaluator). For each claim, check:

  • LLM-only: can the claim be traced to a real, verifiable source? (Google it, check PubMed, etc.)
  • LLM+Metis: does the claim cite a Metis atom? Does the atom have a verified quotedSpan? Does the quotedSpan exist in the source file?

Metric: hallucination_rate = unverifiable_claims / total_claims

2. Consistency rate

Do articles contradict each other?

Method: After producing all 20 articles, extract claims that overlap in topic. Check if any two articles make contradictory statements about the same subject.

  • LLM-only: each article is generated independently — contradictions are likely
  • LLM+Metis: all articles draw from the same knowledge graph — contradictions should be flagged by Metis before they reach the article

Metric: contradiction_count across the full set of articles

3. Production time

How long does it take to produce a publishable, cited article?

Method: Time the full workflow from "topic decided" to "article with citations ready for review."

  • LLM-only: prompt engineering + generation + manual fact-checking + manual citation
  • LLM+Metis: metis apply + chatbot composition (citations come free)

Metric: hours_per_article (include fact-checking time for LLM-only)

Test design

Source material

Build a Metis project with 5-10 healthcare sources on child safety topics:

  • AAP (American Academy of Pediatrics) guidelines on infant sleep
  • WHO feeding recommendations
  • CDC immunization schedules
  • 2-3 peer-reviewed papers on common parenting safety topics
  • 1-2 parenting guidebooks (EPUB or PDF)

Total: ~5-10 sources, enough to create a meaningful knowledge graph.

Profile: strict — this is healthcare content. Full provenance required.

Article topics

10 topics, each written twice (once LLM-only, once LLM+Metis):

  1. Safe sleep practices for newborns
  2. When and how to introduce solid foods
  3. Common infant choking hazards and prevention
  4. Fever management in children under 2
  5. Car seat safety by age and weight
  6. Childhood vaccination schedule and common concerns
  7. Baby-proofing your home room by room
  8. Safe bathing practices for infants
  9. When to call the doctor: warning signs in infants
  10. Sun safety for babies and toddlers

Control conditions

  • LLM-only: The chatbot receives only the topic. No Metis context. It uses its training data.
  • LLM+Metis: The chatbot receives the topic + the full Metis apply output for a relevant query. Same chatbot, same model, same system prompt (except for the Metis context block).
  • Same model for both: Use the same LLM (e.g., Claude Sonnet or Kimi) so the only variable is whether Metis context is provided.
  • Same evaluator for both: The person scoring hallucinations and contradictions should not know which article is which (blind evaluation if possible).

Evaluation process

Phase 1: Build the knowledge base (~1-2 weeks)

  1. Collect 5-10 healthcare sources (EPUBs, PDFs once supported, markdown)
  2. Run metis learn with strict profile
  3. Verify atom count, quotedSpans coverage, entity graph quality
  4. Spot-check 10-20 atoms manually: are the quotes real? Are the claims accurate?

Phase 2: Generate articles (~1 week)

  1. For each topic, run metis apply with a relevant query
  2. Feed the Metis output to the chatbot as context, generate the LLM+Metis article
  3. Generate the LLM-only article with the same chatbot, no Metis context
  4. Save all 20 articles with metadata (which is which, generation time, model used)

Phase 3: Evaluate (~1-2 weeks)

  1. Extract claims from all 20 articles (aim for 15-30 claims per article)
  2. Score each claim: verifiable (with source) or unverifiable (hallucination)
  3. For LLM+Metis claims: verify the Metis provenance chain (atom → quotedSpan → source)
  4. Cross-check all 20 articles for contradictions
  5. Record time spent on each article including any fact-checking

Phase 4: Report

  1. Compile metrics: hallucination rate, contradiction count, hours per article
  2. Include specific examples: "LLM-only said X about fever management, but the AAP actually says Y"
  3. Include the full provenance chain for 2-3 LLM+Metis claims as demonstration
  4. Publish findings (blog post, GitHub discussion, or paper)

Expected outcomes (hypotheses)

Metric LLM-only (expected) LLM+Metis (expected)
Hallucination rate 10-20% of claims <5% (only in chatbot composition layer)
Contradictions across 10 articles 3-8 0-1
Hours per article (with fact-checking) 2-4 hours 0.5-1 hour

If these hypotheses hold, the delta is publishable and sellable:

  • 4x fewer hallucinations
  • Near-zero contradictions
  • 3-4x faster production

Prerequisites

  • PDF parser for Metis (many healthcare sources are PDFs)
  • parentguidebook project set up with strict profile
  • 5-10 healthcare sources collected and vetted
  • Chatbot prompt template for both conditions (with/without Metis context)
  • Claim extraction methodology (manual or LLM-assisted)
  • Evaluator recruited (ideally someone with healthcare knowledge)

Definition of done

  • 20 articles generated (10 pairs)
  • All claims scored for verifiability
  • Contradiction analysis complete
  • Production time recorded
  • Results documented with specific examples
  • At least one blog post or report published with findings

Why this matters

This test produces the proof that turns "Metis adds rigor" from a claim into a measurement. If the numbers hold, this becomes the sales deck for every market we pursue: "In a blind test on healthcare content, LLM+Metis produced 4x fewer hallucinations and zero contradictions across 10 articles." That's not a vibes argument. That's data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions