Skip to content

Latest commit

 

History

History
186 lines (140 loc) · 7.34 KB

File metadata and controls

186 lines (140 loc) · 7.34 KB

Getting Started

Read this to walk through every feature on one concrete step for the first time.

The README shows a minimal SummarizeArticle step. This guide walks through the features you reach for as production requirements grow: budget caps so runaway inputs don't drain your LLM provider budget, evals so you catch regressions in CI, and CI gating so a merge that lowers accuracy gets blocked.

The walkthrough

Start with the README example, then add features one layer at a time. Each is optional — use what you need.

class SummarizeArticle < RubyLLM::Contract::Step::Base
  # 1. Prompt (required)
  prompt <<~PROMPT
    Summarize this article for a UI card. Return a short TL;DR,
    3 to 5 key takeaways, and a tone label.

    {input}
  PROMPT

  # 2. Schema — sent to the provider via with_schema, validated client-side
  output_schema do
    string :tldr
    array  :takeaways, of: :string, min_items: 3, max_items: 5
    string :tone, enum: %w[neutral positive negative analytical]
  end

  # 3. Business rules — things JSON Schema cannot express
  validate("TL;DR fits the card")  { |o, _| o[:tldr].length <= 200 }
  validate("takeaways are unique") { |o, _| o[:takeaways] == o[:takeaways].uniq }

  # 4. Retry with model fallback on validation_failed / parse_error
  retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]

  # 5. Refuse before calling the LLM if input is too large or estimated cost exceeds the cap
  max_input  2_000
  max_output 4_000
  max_cost   0.01
end

Validation and retry behavior

When the cheap model returns output that fails a validate block or can't be parsed, retry falls back to the next model in models: and tries again.

result = SummarizeArticle.run(article_text)

result.status           # => :ok
result.parsed_output    # => { tldr: "...", takeaways: [...], tone: "analytical" }
result.trace[:model]    # => "gpt-4.1-mini"  (first model that passed)
result.trace[:cost]     # => 0.00052  (sum of all attempts)
result.trace[:attempts]
# => [
#   { attempt: 1, model: "gpt-4.1-nano", status: :validation_failed,
#     cost: 0.00010, latency_ms: 45, ... },
#   { attempt: 2, model: "gpt-4.1-mini", status: :ok,
#     cost: 0.00042, latency_ms: 92, ... }
# ]

If the whole chain exhausts, result.status is the status of the last attempt (:validation_failed or :parse_error) and result.parsed_output is the last attempt's output. The caller decides what to do — ship it anyway, fall back to a template, or raise.

Evals and CI gates

An eval is a named scenario you can run to verify the step still works. sample_response makes it offline — zero API calls — so CI can run it on every merge without burning budget.

SummarizeArticle.define_eval("smoke") do
  default_input <<~ARTICLE
    Ruby 3.4 ships with frozen string literals on by default, measurable YJIT
    speedups on Rails workloads, and tightened Warning.warn category filtering.
    The release notes also mention several parser fixes and faster keyword
    argument handling.
  ARTICLE

  sample_response({
    tldr: "Ruby 3.4 brings frozen string literals by default, YJIT speedups, and parser fixes.",
    takeaways: [
      "Frozen string literals are the default",
      "YJIT adds measurable speedups on Rails workloads",
      "Warning.warn category filtering is tighter"
    ],
    tone: "analytical"
  })
end

report = SummarizeArticle.run_eval("smoke")
report.passed?  # => true — schema + validates pass on the canned response
report.score    # => 1.0
report.print_summary

For real regression testing, define cases with expected output (online — calls the LLM):

SummarizeArticle.define_eval("regression") do
  add_case "ruby release",
           input: "Ruby 3.4 was released...",
           expected: { tone: "analytical" }  # partial match

  add_case "critical review",
           input: "The new mesh networking hardware failed under load...",
           expected: { tone: "negative" }
end

Gate CI on score and cost thresholds:

# RSpec — blocks merge if accuracy drops or cost spikes
expect(SummarizeArticle).to pass_eval("regression")
  .with_minimum_score(0.8)
  .with_maximum_cost(0.01)

Save a baseline once, then block regressions automatically:

report = SummarizeArticle.run_eval("regression")
report.save_baseline!

# In CI:
expect(SummarizeArticle).to pass_eval("regression").without_regressions

without_regressions fails the build only if a previously-passing case now fails — a new model version, a prompt tweak, or an upstream change that silently lowered quality.

Budget caps

max_input, max_output, and max_cost are preflight checks — the LLM is never called if an estimate exceeds the limit. Zero tokens spent, zero cost.

result = SummarizeArticle.run(giant_10mb_document)
result.status            # => :limit_exceeded
result.validation_errors
# => ["Input token limit exceeded: estimated 32000 tokens (heuristic ±30%), max 2000"]

max_cost fails closed when the model's pricing isn't known — register custom or fine-tuned models explicitly:

RubyLLM::Contract::CostCalculator.register_model("ft:gpt-4o-custom",
  input_per_1m: 3.0, output_per_1m: 6.0)

Or opt into a soft warning instead of a refusal when pricing is missing:

max_cost 0.01, on_unknown_pricing: :warn

Default is :refuse. Use :warn only when you accept running without a cost ceiling (fine-tuned models you trust, private endpoints).

Preflight cost estimates

Check what a call is likely to cost before invoking it:

SummarizeArticle.estimate_cost(input: article_text)
# => {
#      model: "gpt-4.1-mini",
#      input_tokens: 812, output_tokens_estimate: 4000,
#      estimated_cost: 0.00243
#    }

# Estimate what a full eval would cost across candidate models
SummarizeArticle.estimate_eval_cost("regression",
  models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1])
# => { "gpt-4.1-nano" => 0.00041, "gpt-4.1-mini" => 0.0018, "gpt-4.1" => 0.0092 }

estimate_cost returns nil when pricing isn't registered. estimate_eval_cost silently treats unknown-pricing cases as $0.00 and sums the rest — it does not fail closed the way max_cost does. Treat its output as a floor, not a guarantee; register pricing via CostCalculator.register_model before relying on it for budget decisions.

output_schema vs with_schema

with_schema in ruby_llm tells the provider to force a specific JSON structure. output_schema in this gem does the same thing (calls with_schema under the hood) plus validates the response client-side. Cheaper models sometimes ignore schema constraints — with_schema is a request; output_schema is a request plus verification.

See also

  • Prompt AST — prompt DSL variants: system, rule, section, example, user, and dynamic prompts with |input|.
  • Eval-First — datasets, baselines, A/B gates, the workflow that makes the above evals useful.
  • Optimizing retry_policy — find the cheapest viable fallback list with compare_models and optimize_retry_policy.
  • Testing — test adapter, stub_step, full RSpec + Minitest matcher reference.
  • Output Schema — nested objects in arrays, constraints, pattern reference.
  • Rails integration — where step classes live, initializer, jobs, logging, specs, CI gate.