Read this to walk through every feature on one concrete step for the first time.
The README shows a minimal SummarizeArticle step. This guide walks through the features you reach for as production requirements grow: budget caps so runaway inputs don't drain your LLM provider budget, evals so you catch regressions in CI, and CI gating so a merge that lowers accuracy gets blocked.
Start with the README example, then add features one layer at a time. Each is optional — use what you need.
class SummarizeArticle < RubyLLM::Contract::Step::Base
# 1. Prompt (required)
prompt <<~PROMPT
Summarize this article for a UI card. Return a short TL;DR,
3 to 5 key takeaways, and a tone label.
{input}
PROMPT
# 2. Schema — sent to the provider via with_schema, validated client-side
output_schema do
string :tldr
array :takeaways, of: :string, min_items: 3, max_items: 5
string :tone, enum: %w[neutral positive negative analytical]
end
# 3. Business rules — things JSON Schema cannot express
validate("TL;DR fits the card") { |o, _| o[:tldr].length <= 200 }
validate("takeaways are unique") { |o, _| o[:takeaways] == o[:takeaways].uniq }
# 4. Retry with model fallback on validation_failed / parse_error
retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
# 5. Refuse before calling the LLM if input is too large or estimated cost exceeds the cap
max_input 2_000
max_output 4_000
max_cost 0.01
endWhen the cheap model returns output that fails a validate block or can't be parsed, retry falls back to the next model in models: and tries again.
result = SummarizeArticle.run(article_text)
result.status # => :ok
result.parsed_output # => { tldr: "...", takeaways: [...], tone: "analytical" }
result.trace[:model] # => "gpt-4.1-mini" (first model that passed)
result.trace[:cost] # => 0.00052 (sum of all attempts)
result.trace[:attempts]
# => [
# { attempt: 1, model: "gpt-4.1-nano", status: :validation_failed,
# cost: 0.00010, latency_ms: 45, ... },
# { attempt: 2, model: "gpt-4.1-mini", status: :ok,
# cost: 0.00042, latency_ms: 92, ... }
# ]If the whole chain exhausts, result.status is the status of the last attempt (:validation_failed or :parse_error) and result.parsed_output is the last attempt's output. The caller decides what to do — ship it anyway, fall back to a template, or raise.
An eval is a named scenario you can run to verify the step still works. sample_response makes it offline — zero API calls — so CI can run it on every merge without burning budget.
SummarizeArticle.define_eval("smoke") do
default_input <<~ARTICLE
Ruby 3.4 ships with frozen string literals on by default, measurable YJIT
speedups on Rails workloads, and tightened Warning.warn category filtering.
The release notes also mention several parser fixes and faster keyword
argument handling.
ARTICLE
sample_response({
tldr: "Ruby 3.4 brings frozen string literals by default, YJIT speedups, and parser fixes.",
takeaways: [
"Frozen string literals are the default",
"YJIT adds measurable speedups on Rails workloads",
"Warning.warn category filtering is tighter"
],
tone: "analytical"
})
end
report = SummarizeArticle.run_eval("smoke")
report.passed? # => true — schema + validates pass on the canned response
report.score # => 1.0
report.print_summaryFor real regression testing, define cases with expected output (online — calls the LLM):
SummarizeArticle.define_eval("regression") do
add_case "ruby release",
input: "Ruby 3.4 was released...",
expected: { tone: "analytical" } # partial match
add_case "critical review",
input: "The new mesh networking hardware failed under load...",
expected: { tone: "negative" }
endGate CI on score and cost thresholds:
# RSpec — blocks merge if accuracy drops or cost spikes
expect(SummarizeArticle).to pass_eval("regression")
.with_minimum_score(0.8)
.with_maximum_cost(0.01)Save a baseline once, then block regressions automatically:
report = SummarizeArticle.run_eval("regression")
report.save_baseline!
# In CI:
expect(SummarizeArticle).to pass_eval("regression").without_regressionswithout_regressions fails the build only if a previously-passing case now fails — a new model version, a prompt tweak, or an upstream change that silently lowered quality.
max_input, max_output, and max_cost are preflight checks — the LLM is never called if an estimate exceeds the limit. Zero tokens spent, zero cost.
result = SummarizeArticle.run(giant_10mb_document)
result.status # => :limit_exceeded
result.validation_errors
# => ["Input token limit exceeded: estimated 32000 tokens (heuristic ±30%), max 2000"]max_cost fails closed when the model's pricing isn't known — register custom or fine-tuned models explicitly:
RubyLLM::Contract::CostCalculator.register_model("ft:gpt-4o-custom",
input_per_1m: 3.0, output_per_1m: 6.0)Or opt into a soft warning instead of a refusal when pricing is missing:
max_cost 0.01, on_unknown_pricing: :warnDefault is :refuse. Use :warn only when you accept running without a cost ceiling (fine-tuned models you trust, private endpoints).
Check what a call is likely to cost before invoking it:
SummarizeArticle.estimate_cost(input: article_text)
# => {
# model: "gpt-4.1-mini",
# input_tokens: 812, output_tokens_estimate: 4000,
# estimated_cost: 0.00243
# }
# Estimate what a full eval would cost across candidate models
SummarizeArticle.estimate_eval_cost("regression",
models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1])
# => { "gpt-4.1-nano" => 0.00041, "gpt-4.1-mini" => 0.0018, "gpt-4.1" => 0.0092 }estimate_cost returns nil when pricing isn't registered. estimate_eval_cost silently treats unknown-pricing cases as $0.00 and sums the rest — it does not fail closed the way max_cost does. Treat its output as a floor, not a guarantee; register pricing via CostCalculator.register_model before relying on it for budget decisions.
with_schema in ruby_llm tells the provider to force a specific JSON structure. output_schema in this gem does the same thing (calls with_schema under the hood) plus validates the response client-side. Cheaper models sometimes ignore schema constraints — with_schema is a request; output_schema is a request plus verification.
- Prompt AST — prompt DSL variants:
system,rule,section,example,user, and dynamic prompts with|input|. - Eval-First — datasets, baselines, A/B gates, the workflow that makes the above evals useful.
- Optimizing retry_policy — find the cheapest viable fallback list with
compare_modelsandoptimize_retry_policy. - Testing — test adapter,
stub_step, full RSpec + Minitest matcher reference. - Output Schema — nested objects in arrays, constraints, pattern reference.
- Rails integration — where step classes live, initializer, jobs, logging, specs, CI gate.