Geval

Eval-driven release decisions for AI systems

Quick Start • Features • Documentation • Examples • Contributing

What is Geval?

Geval is a core library for eval-driven release decisions in AI systems.

Modern AI teams run evals, inspect traces, and review results — but release decisions are still implicit, manual, and fragmented across CI scripts, Slack threads, and human judgment.

Geval exists to make those decisions explicit, reviewable, and enforceable.

It consumes evaluation results and other decision signals you already trust, applies decision contracts, and produces a deterministic outcome:

✅ allowed to ship
⚠️ allowed only with explicit human approval
❌ blocked from release

CI/CD is where enforcement happens — but the real value of Geval is creating a single, auditable decision layer between evals and production.

Eval Signals + Human Context → Geval Contract → Release Decision Record

Why Geval exists

Teams adopt Geval when:

evals are directional, not absolute
humans are always in the loop
"half-baked" AI changes keep slipping through
retros keep asking "why did we ship this?"
custom CI scripts start to rot

Geval does not try to improve eval quality.

It removes decision ambiguity.

Who Geval is for

Geval is designed for:

AI platform and infra engineers
Teams with real production blast radius
Systems where AI behavior affects users, safety, or revenue
Teams tired of informal release decisions

Who Geval is not for

Geval is not a fit for:

exploration-only or research workflows
teams happy with "we'll monitor it"
users looking for dashboards or insights
anyone expecting fully automated truth from evals

Geval is:

✅ Eval-agnostic – Consumes outputs from Promptfoo, LangSmith exports, OpenEvals, or custom tooling
✅ Signal-driven – Works with eval metrics, human review decisions, risk flags, and external references
✅ Contract-based – Encodes release intent and tolerance explicitly as code
✅ Decision-centric – Produces PASS / REQUIRE_APPROVAL / BLOCK outcomes with clear reasons
✅ CI-native (but not CI-limited) – Designed for CI enforcement, applicable anywhere release decisions are made
✅ Open-source core – All decision logic and contracts are inspectable and deterministic

Geval is not:

❌ An eval runner – it never executes evals or workflows
❌ An observability or tracing tool – it does not replace Phoenix or LangSmith
❌ A monitoring system – it does not watch production traffic
❌ A testing framework – it validates existing results, not tests
❌ A quality oracle – it does not define correctness or universal benchmarks

Those tools answer:

"What happened?"

Geval answers:

"Given what we knew at the time, was this allowed to ship?"

Quick Start

Installation

# Install CLI globally
npm install -g @geval-labs/cli

# Or install as a dev dependency
npm install --save-dev @geval-labs/cli

1. Create a Contract

Create a contract.yaml file:

version: 1
name: quality-gate
description: Quality requirements for production deployment
environment: production

# Define how to parse CSV files
sources:
  csv:
    metrics:
      - column: accuracy
        aggregate: avg
      - column: latency_ms
        aggregate: p95
        as: latency_p95
      - column: error_rate
        aggregate: avg
    evalName:
      fixed: quality-metrics

# Define quality requirements
required_evals:
  - name: quality-metrics
    rules:
      - metric: accuracy
        operator: ">="
        baseline: fixed
        threshold: 0.85
        description: Accuracy must be at least 85%

      - metric: latency_p95
        operator: "<="
        baseline: fixed
        threshold: 200
        description: P95 latency must be under 200ms

      - metric: error_rate
        operator: "<="
        baseline: fixed
        threshold: 0.01
        description: Error rate must be under 1%

on_violation:
  action: block
  message: "Quality metrics did not meet production requirements"

2. Run Geval

# Check eval results against contract
geval check --contract contract.yaml --eval results.csv

# With baseline comparison
geval check --contract contract.yaml --eval results.csv --baseline baseline.json

# JSON output for CI/CD
geval check --contract contract.yaml --eval results.csv --json

3. Output

Pass:

✓ PASS

Contract:    quality-gate
Version:     1

All 1 eval(s) passed contract requirements

Block:

✗ BLOCK

Contract:    quality-gate
Version:     1

Blocked: 2 violation(s) in 1 eval

Violations

  1. quality-metrics → accuracy
     accuracy = 0.72, expected >= 0.85
     Actual: 0.72 | Baseline: 0.85

  2. quality-metrics → latency_p95
     latency_p95 = 250, expected <= 200
     Actual: 250 | Baseline: 200

Features

🎯 Multi-Format Support

Geval supports multiple eval result formats:

CSV - Raw data exports from any tool
JSON - Normalized eval results or tool-specific formats
JSONL - Line-delimited JSON files

📊 Inline Source Configuration

Define CSV parsing directly in your contract - no separate config files needed:

sources:
  csv:
    metrics:
      - column: accuracy
        aggregate: avg
      - column: latency
        aggregate: p95
        as: latency_p95
    evalName:
      fixed: my-eval

📈 Baseline Comparisons

Compare against:

Fixed thresholds - Absolute quality gates
Previous runs - Detect regressions
Main branch - Compare against production baseline

🔌 Adapter Support

Built-in adapters for popular eval tools:

Promptfoo - Automatic format detection
LangSmith - CSV export support
OpenEvals - JSON format support
Generic - Custom JSON formats

🎛️ Signal-Based Policies (New)

Geval now supports policy-based contracts that can evaluate signals beyond just eval metrics:

Signal types: eval, human_review, risk_flag, external_reference
Signal conditions: Match signals by type, name, or value
Environment-specific rules: Different policies for dev/staging/prod
Human decisions: Record approvals/rejections as signals

🚀 CI/CD Integration

Exit codes: 0 PASS · 1 BLOCK · 2 REQUIRES_APPROVAL · 3 ERROR
JSON output: --json flag for programmatic use
Multiple files: Support for multiple eval result files
Decision records: Automatic generation of decision artifacts with hashes

CLI Commands

`geval check`

Evaluate contracts against eval results and enforce decisions.

geval check --contract <path> --eval <paths...> [options]

Options:

-c, --contract <path> - Path to eval contract (YAML/JSON) (required)
-e, --eval <paths...> - Path(s) to eval result files (CSV/JSON/JSONL) (required)
-b, --baseline <path> - Path to baseline eval results for comparison
--adapter <name> - Force specific adapter (promptfoo, langsmith, openevals, generic)
--json - Output results as JSON
--no-color - Disable colored output
--verbose - Show detailed output

Examples:

# Basic check
geval check -c contract.yaml -e results.csv

# Multiple eval files
geval check -c contract.yaml -e results1.json results2.json

# With baseline comparison
geval check -c contract.yaml -e results.csv -b baseline.json

# JSON output for CI
geval check -c contract.yaml -e results.csv --json

`geval diff`

Compare eval results between two runs.

geval diff --previous <path> --current <path> [options]

Options:

-p, --previous <path> - Path to previous eval results (required)
-c, --current <path> - Path to current eval results (required)
--json - Output results as JSON
--no-color - Disable colored output

Example:

geval diff --previous baseline.json --current new.json

`geval explain`

Explain why a contract passed or failed with detailed analysis.

geval explain --contract <path> --eval <paths...> [options]

Options:

-c, --contract <path> - Path to eval contract (YAML/JSON) (required)
-e, --eval <paths...> - Path(s) to eval result files (required)
-b, --baseline <path> - Path to baseline eval results
--verbose - Show detailed explanations
--no-color - Disable colored output

Example:

geval explain -c contract.yaml -e results.json --verbose

`geval validate`

Validate a contract file for syntax and semantic correctness.

geval validate <path> [options]

Options:

<path> - Path to eval contract (YAML/JSON) (required)
--strict - Enable strict validation (warnings become errors)
--json - Output results as JSON

Example:

geval validate contract.yaml --strict

`geval approve`

Record a human approval decision. Creates an approval artifact that can be used as a signal.

geval approve --reason <reason> [options]

Options:

-r, --reason <reason> - Reason for approval (required)
-o, --output <path> - Output file path (default: geval-approval.json)
--by <name> - Name of approver (defaults to $USER)

Example:

geval approve --reason "Reviewed with product team" --by "alice"

`geval reject`

Record a human rejection decision. Creates a rejection artifact.

geval reject --reason <reason> [options]

Options:

-r, --reason <reason> - Reason for rejection (required)
-o, --output <path> - Output file path (default: geval-rejection.json)
--by <name> - Name of reviewer (defaults to $USER)

Example:

geval reject --reason "Customer risk too high" --by "bob"

Contract Reference

Geval supports two contract formats:

Legacy Format (Eval-Based)

version: 1 # Contract schema version (required)
name: string # Contract name (required)
description: string # Optional description
environment: production # Optional: production | staging | development

sources: # Optional: inline CSV/JSON parsing config
  csv:
    metrics:
      - column: accuracy
        aggregate: avg
    evalName:
      fixed: my-eval

required_evals: # Required: list of eval suites to check
  - name: eval-name
    description: string # Optional
    rules: # Required: list of rules
      - metric: accuracy
        operator: ">="
        baseline: fixed
        threshold: 0.85

on_violation: # Required: violation handler
  action: block # block | require_approval | warn
  message: string # Optional custom message

metadata: # Optional: contract metadata
  key: value # All values must be strings

Policy Format (Signal-Based) - New

version: 1
name: policy-contract
environment: production

policy:
  # Environment-specific policies
  environments:
    development:
      default: pass # Default action when no rules match

    production:
      default: require_approval
      rules:
        # Eval-based rule
        - when:
            eval:
              metric: accuracy
              operator: ">="
              baseline: fixed
              threshold: 0.90
          then:
            action: pass
            reason: "Accuracy meets threshold"

        # Signal-based rule
        - when:
            signal:
              type: risk_flag
              field: level
              operator: "=="
              value: "high"
          then:
            action: block
            reason: "High risk detected"

  # Global rules (applied to all environments)
  rules:
    - when:
        signal:
          type: human_review
          field: decision
          operator: "=="
          value: "rejected"
      then:
        action: block
        reason: "Human review rejected"

Aggregation Methods

When parsing CSV files, you can aggregate metrics using:

Method	Description
`avg`	Average of all values (default)
`sum`	Sum of all values
`min`	Minimum value
`max`	Maximum value
`count`	Count of non-null values
`p50`	50th percentile (median)
`p90`	90th percentile
`p95`	95th percentile
`p99`	99th percentile
`pass_rate`	Percentage of "success"/"pass"/true/1 values
`fail_rate`	Percentage of "error"/"fail"/false/0 values
`first`	First value
`last`	Last value

Comparison Operators

Operator	Description
`==`	Equal to
`!=`	Not equal to
`<`	Less than
`<=`	Less than or equal to
`>`	Greater than
`>=`	Greater than or equal to

Baseline Types

Type	Description	Required Fields
`fixed`	Compare against a fixed threshold	`threshold`
`previous`	Compare against previous run	`maxDelta` (optional)
`main`	Compare against main branch baseline	`maxDelta` (optional)

Source Configuration

CSV Source Config:

sources:
  csv:
    metrics:
      - column: accuracy # Column name
        aggregate: avg # Aggregation method
        as: accuracy_score # Optional: rename metric
    evalName:
      fixed: my-eval # Fixed eval name
      # OR
      # column: eval_name        # Extract from column
    runId:
      fixed: run-123 # Fixed run ID
      # OR
      # column: run_id           # Extract from column

JSON Source Config:

sources:
  json:
    metrics:
      - column: accuracy
        aggregate: first
    evalName:
      fixed: my-eval

Note: JSON files in normalized format (with evalName, runId, metrics at top level) are auto-detected and don't require source config.

Signals

Signals are generic inputs to the decision engine beyond eval metrics:

{
  "signals": [
    {
      "id": "signal-1",
      "type": "risk_flag",
      "name": "security-risk",
      "value": { "level": "high" },
      "metadata": { "source": "security-scan" }
    },
    {
      "id": "signal-2",
      "type": "human_review",
      "name": "approval",
      "value": { "decision": "approved", "by": "alice" }
    }
  ]
}

Signal Types:

eval - References eval result artifacts
human_review - Human approval/rejection decisions
risk_flag - Risk level indicators (low/medium/high)
external_reference - External URLs or references

Using Signals:

geval check --contract contract.yaml --eval results.csv --signals signals.json

CI/CD Integration

GitHub Actions

name: Eval Check

on: [pull_request]

jobs:
  eval-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: "18"

      - name: Run evals
        run: npm run evals # Your eval command

      - name: Install Geval
        run: npm install -g @geval-labs/cli

      - name: Check eval results
        run: |
          geval check \
            --contract contract.yaml \
            --eval eval-results.csv

GitLab CI

eval-check:
  image: node:18
  script:
    - npm install -g @geval-labs/cli
    - npm run evals
    - geval check --contract contract.yaml --eval eval-results.csv

Exit Codes

Code	Status	Description
`0`	PASS	All evals passed contract requirements
`1`	BLOCK	Contract violated, release blocked
`2`	REQUIRES_APPROVAL	Approval needed to proceed
`3`	ERROR	Execution error

Programmatic Usage

Installation

npm install @geval-labs/core

Basic Usage

import {
  parseContractFromYaml,
  parseEvalFile,
  evaluate,
  formatDecision,
  type Decision,
} from "@geval-labs/core";
import { readFileSync } from "fs";

// Load contract
const contractYaml = readFileSync("contract.yaml", "utf-8");
const contract = parseContractFromYaml(contractYaml);

// Parse eval results
const csvContent = readFileSync("results.csv", "utf-8");
const evalResult = parseEvalFile(csvContent, "results.csv", contract);

// Evaluate
const decision = evaluate({
  contract,
  evalResults: [evalResult],
  baselines: {},
});

// Check result
console.log(decision.status); // "PASS" | "BLOCK" | "REQUIRES_APPROVAL"

// Format output
console.log(formatDecision(decision, { colors: true, verbose: true }));

With Baseline Comparison

import { evaluate, type BaselineData } from "@geval-labs/core";

const decision = evaluate({
  contract,
  evalResults: [currentResult],
  baselines: {
    "my-eval": {
      type: "previous",
      metrics: baselineResult.metrics,
      source: { runId: baselineResult.runId },
    },
  },
});

With Signals and Policy

import {
  parseContractFromYaml,
  parseEvalFile,
  parseSignals,
  evaluate,
  createDecisionRecord,
} from "@geval-labs/core";

// Parse contract with policy
const contract = parseContractFromYaml(contractYaml);

// Parse eval results
const evalResult = parseEvalFile(csvContent, "results.csv", contract);

// Parse signals
const signalsData = JSON.parse(fs.readFileSync("signals.json", "utf-8"));
const { signals } = parseSignals(signalsData);

// Evaluate with signals
const decision = evaluate({
  contract,
  evalResults: [evalResult],
  baselines: {},
  signals,
  environment: "production",
});

// Create decision record
const record = createDecisionRecord({
  decision,
  environment: "production",
  contract,
  evalResults: [evalResult],
  signals,
});

Available Exports

Core Functions:

parseContract(data) - Parse contract from object
parseContractFromYaml(yaml) - Parse contract from YAML string
validateContract(contract) - Validate contract semantics
evaluate(input) - Evaluate contract against results
parseEvalFile(content, filePath, contract) - Parse eval file (CSV/JSON/JSONL)
parseEvalResult(data) - Parse normalized eval result
diffEvalResults(previous, current) - Compare eval results
diffContracts(previous, current) - Compare contracts
formatDecision(decision, options) - Format decision output

Adapters:

GenericAdapter - Generic JSON adapter
PromptfooAdapter - Promptfoo format adapter
LangSmithAdapter - LangSmith format adapter
OpenEvalsAdapter - OpenEvals format adapter
detectAdapter(data) - Auto-detect adapter
parseWithAdapter(data, adapterName) - Parse with specific adapter

Types:

EvalContract - Contract type
Decision - Evaluation decision
Violation - Rule violation
NormalizedEvalResult - Normalized eval result
BaselineData - Baseline data structure
Signal - Signal type
SignalCollection - Collection of signals
HumanDecision - Human approval/rejection decision
DecisionRecord - Decision record with hashes
Policy - Policy-based contract structure

Examples & Tutorials

We provide comprehensive examples in the examples/ directory:

Example 1: Performance Monitoring - API performance metrics (latency, throughput, error rates)
Example 2: Safety & Compliance - AI safety metrics (toxicity, bias, PII leakage)
Example 3: Multi-Eval Comparison - Comprehensive evaluation across multiple test suites
Complete Walkthrough - Full workflow demonstration

Run all examples:

cd examples
npm install
npx tsx run-all-examples.ts

Packages

Package	Description	npm
`@geval-labs/cli`	Command-line interface
`@geval-labs/core`	Core library for programmatic use

Development

Prerequisites

Node.js >= 18.0.0
npm >= 9.0.0

Setup

# Clone repository
git clone https://github.com/geval-labs/geval.git
cd geval

# Install dependencies
npm install

# Build packages
npm run build

# Run tests
npm test

# Run linting
npm run lint

Project Structure

packages/
├── core/          # Core library: contract parsing, evaluation engine, adapters
└── cli/           # Command-line interface

examples/          # Complete working examples
├── example-1-performance/
├── example-2-safety/
├── example-3-multi-eval/
└── complete-walkthrough/

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

Website • GitHub • npm • Documentation

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.changeset		.changeset
.github		.github
examples		examples
packages		packages
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.prettierrc		.prettierrc
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.base.json		tsconfig.base.json
turbo.json		turbo.json

Folders and files

Latest commit

History

Repository files navigation

Geval

What is Geval?

Why Geval exists

Who Geval is for

Who Geval is not for

Quick Start

Installation

1. Create a Contract

2. Run Geval

3. Output

Features

🎯 Multi-Format Support

📊 Inline Source Configuration

📈 Baseline Comparisons

🔌 Adapter Support

🎛️ Signal-Based Policies (New)

🚀 CI/CD Integration

CLI Commands

geval check

geval diff

geval explain

geval validate

geval approve

geval reject

Contract Reference

Legacy Format (Eval-Based)

Policy Format (Signal-Based) - New

Aggregation Methods

Comparison Operators

Baseline Types

Source Configuration

Signals

CI/CD Integration

GitHub Actions

GitLab CI

Exit Codes

Programmatic Usage

Installation

Basic Usage

With Baseline Comparison

With Signals and Policy

Available Exports

Examples & Tutorials

Packages

Development

Prerequisites

Setup

Project Structure

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

`geval check`

`geval diff`

`geval explain`

`geval validate`

`geval approve`

`geval reject`

Packages