Test plan: LLM-only vs LLM+Metis healthcare article quality comparison

## Objective

Measure whether Metis-backed content is provably better than LLM-only content for healthcare writing, using parentguidebook as the test bed. The goal is not a vibes comparison ("which reads better?") but a metrics-based evaluation that produces numbers we can publish.

## The three metrics

### 1. Hallucination rate
How many claims in the article have no backing in any source?

**Method:** For each article, extract every factual claim (manually or with an LLM evaluator). For each claim, check:
- LLM-only: can the claim be traced to a real, verifiable source? (Google it, check PubMed, etc.)
- LLM+Metis: does the claim cite a Metis atom? Does the atom have a verified quotedSpan? Does the quotedSpan exist in the source file?

**Metric:** `hallucination_rate = unverifiable_claims / total_claims`

### 2. Consistency rate
Do articles contradict each other?

**Method:** After producing all 20 articles, extract claims that overlap in topic. Check if any two articles make contradictory statements about the same subject.
- LLM-only: each article is generated independently — contradictions are likely
- LLM+Metis: all articles draw from the same knowledge graph — contradictions should be flagged by Metis before they reach the article

**Metric:** `contradiction_count` across the full set of articles

### 3. Production time
How long does it take to produce a publishable, cited article?

**Method:** Time the full workflow from "topic decided" to "article with citations ready for review."
- LLM-only: prompt engineering + generation + manual fact-checking + manual citation
- LLM+Metis: metis apply + chatbot composition (citations come free)

**Metric:** `hours_per_article` (include fact-checking time for LLM-only)

## Test design

### Source material
Build a Metis project with 5-10 healthcare sources on child safety topics:
- AAP (American Academy of Pediatrics) guidelines on infant sleep
- WHO feeding recommendations
- CDC immunization schedules
- 2-3 peer-reviewed papers on common parenting safety topics
- 1-2 parenting guidebooks (EPUB or PDF)

Total: ~5-10 sources, enough to create a meaningful knowledge graph.

**Profile:** `strict` — this is healthcare content. Full provenance required.

### Article topics
10 topics, each written twice (once LLM-only, once LLM+Metis):

1. Safe sleep practices for newborns
2. When and how to introduce solid foods
3. Common infant choking hazards and prevention
4. Fever management in children under 2
5. Car seat safety by age and weight
6. Childhood vaccination schedule and common concerns
7. Baby-proofing your home room by room
8. Safe bathing practices for infants
9. When to call the doctor: warning signs in infants
10. Sun safety for babies and toddlers

### Control conditions
- **LLM-only:** The chatbot receives only the topic. No Metis context. It uses its training data.
- **LLM+Metis:** The chatbot receives the topic + the full Metis apply output for a relevant query. Same chatbot, same model, same system prompt (except for the Metis context block).
- **Same model for both:** Use the same LLM (e.g., Claude Sonnet or Kimi) so the only variable is whether Metis context is provided.
- **Same evaluator for both:** The person scoring hallucinations and contradictions should not know which article is which (blind evaluation if possible).

### Evaluation process

**Phase 1: Build the knowledge base (~1-2 weeks)**
1. Collect 5-10 healthcare sources (EPUBs, PDFs once supported, markdown)
2. Run `metis learn` with `strict` profile
3. Verify atom count, quotedSpans coverage, entity graph quality
4. Spot-check 10-20 atoms manually: are the quotes real? Are the claims accurate?

**Phase 2: Generate articles (~1 week)**
1. For each topic, run `metis apply` with a relevant query
2. Feed the Metis output to the chatbot as context, generate the LLM+Metis article
3. Generate the LLM-only article with the same chatbot, no Metis context
4. Save all 20 articles with metadata (which is which, generation time, model used)

**Phase 3: Evaluate (~1-2 weeks)**
1. Extract claims from all 20 articles (aim for 15-30 claims per article)
2. Score each claim: verifiable (with source) or unverifiable (hallucination)
3. For LLM+Metis claims: verify the Metis provenance chain (atom → quotedSpan → source)
4. Cross-check all 20 articles for contradictions
5. Record time spent on each article including any fact-checking

**Phase 4: Report**
1. Compile metrics: hallucination rate, contradiction count, hours per article
2. Include specific examples: "LLM-only said X about fever management, but the AAP actually says Y"
3. Include the full provenance chain for 2-3 LLM+Metis claims as demonstration
4. Publish findings (blog post, GitHub discussion, or paper)

## Expected outcomes (hypotheses)

| Metric | LLM-only (expected) | LLM+Metis (expected) |
|---|---|---|
| Hallucination rate | 10-20% of claims | <5% (only in chatbot composition layer) |
| Contradictions across 10 articles | 3-8 | 0-1 |
| Hours per article (with fact-checking) | 2-4 hours | 0.5-1 hour |

If these hypotheses hold, the delta is publishable and sellable:
- **4x fewer hallucinations**
- **Near-zero contradictions**
- **3-4x faster production**

## Prerequisites

- [ ] PDF parser for Metis (many healthcare sources are PDFs)
- [ ] parentguidebook project set up with strict profile
- [ ] 5-10 healthcare sources collected and vetted
- [ ] Chatbot prompt template for both conditions (with/without Metis context)
- [ ] Claim extraction methodology (manual or LLM-assisted)
- [ ] Evaluator recruited (ideally someone with healthcare knowledge)

## Definition of done

- 20 articles generated (10 pairs)
- All claims scored for verifiability
- Contradiction analysis complete
- Production time recorded
- Results documented with specific examples
- At least one blog post or report published with findings

## Why this matters

This test produces the proof that turns "Metis adds rigor" from a claim into a measurement. If the numbers hold, this becomes the sales deck for every market we pursue: "In a blind test on healthcare content, LLM+Metis produced 4x fewer hallucinations and zero contradictions across 10 articles." That's not a vibes argument. That's data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test plan: LLM-only vs LLM+Metis healthcare article quality comparison #17

Objective

The three metrics

1. Hallucination rate

2. Consistency rate

3. Production time

Test design

Source material

Article topics

Control conditions

Evaluation process

Expected outcomes (hypotheses)

Prerequisites

Definition of done

Why this matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	LLM-only (expected)	LLM+Metis (expected)
Hallucination rate	10-20% of claims	<5% (only in chatbot composition layer)
Contradictions across 10 articles	3-8	0-1
Hours per article (with fact-checking)	2-4 hours	0.5-1 hour

Test plan: LLM-only vs LLM+Metis healthcare article quality comparison #17

Description

Objective

The three metrics

1. Hallucination rate

2. Consistency rate

3. Production time

Test design

Source material

Article topics

Control conditions

Evaluation process

Expected outcomes (hypotheses)

Prerequisites

Definition of done

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions