Eval-driven release decisions for AI systems
Quick Start • Features • Documentation • Examples • Contributing
Geval is a core library for eval-driven release decisions in AI systems.
Modern AI teams run evals, inspect traces, and review results — but release decisions are still implicit, manual, and fragmented across CI scripts, Slack threads, and human judgment.
Geval exists to make those decisions explicit, reviewable, and enforceable.
It consumes evaluation results and other decision signals you already trust, applies decision contracts, and produces a deterministic outcome:
- ✅ allowed to ship
⚠️ allowed only with explicit human approval- ❌ blocked from release
CI/CD is where enforcement happens — but the real value of Geval is creating a single, auditable decision layer between evals and production.
Eval Signals + Human Context → Geval Contract → Release Decision Record
Teams adopt Geval when:
- evals are directional, not absolute
- humans are always in the loop
- "half-baked" AI changes keep slipping through
- retros keep asking "why did we ship this?"
- custom CI scripts start to rot
Geval does not try to improve eval quality.
It removes decision ambiguity.
Geval is designed for:
- AI platform and infra engineers
- Teams with real production blast radius
- Systems where AI behavior affects users, safety, or revenue
- Teams tired of informal release decisions
Geval is not a fit for:
- exploration-only or research workflows
- teams happy with "we'll monitor it"
- users looking for dashboards or insights
- anyone expecting fully automated truth from evals
Geval is:
- ✅ Eval-agnostic – Consumes outputs from Promptfoo, LangSmith exports, OpenEvals, or custom tooling
- ✅ Signal-driven – Works with eval metrics, human review decisions, risk flags, and external references
- ✅ Contract-based – Encodes release intent and tolerance explicitly as code
- ✅ Decision-centric – Produces PASS / REQUIRE_APPROVAL / BLOCK outcomes with clear reasons
- ✅ CI-native (but not CI-limited) – Designed for CI enforcement, applicable anywhere release decisions are made
- ✅ Open-source core – All decision logic and contracts are inspectable and deterministic
Geval is not:
- ❌ An eval runner – it never executes evals or workflows
- ❌ An observability or tracing tool – it does not replace Phoenix or LangSmith
- ❌ A monitoring system – it does not watch production traffic
- ❌ A testing framework – it validates existing results, not tests
- ❌ A quality oracle – it does not define correctness or universal benchmarks
Those tools answer:
"What happened?"
Geval answers:
"Given what we knew at the time, was this allowed to ship?"
# Install CLI globally
npm install -g @geval-labs/cli
# Or install as a dev dependency
npm install --save-dev @geval-labs/cliCreate a contract.yaml file:
version: 1
name: quality-gate
description: Quality requirements for production deployment
environment: production
# Define how to parse CSV files
sources:
csv:
metrics:
- column: accuracy
aggregate: avg
- column: latency_ms
aggregate: p95
as: latency_p95
- column: error_rate
aggregate: avg
evalName:
fixed: quality-metrics
# Define quality requirements
required_evals:
- name: quality-metrics
rules:
- metric: accuracy
operator: ">="
baseline: fixed
threshold: 0.85
description: Accuracy must be at least 85%
- metric: latency_p95
operator: "<="
baseline: fixed
threshold: 200
description: P95 latency must be under 200ms
- metric: error_rate
operator: "<="
baseline: fixed
threshold: 0.01
description: Error rate must be under 1%
on_violation:
action: block
message: "Quality metrics did not meet production requirements"# Check eval results against contract
geval check --contract contract.yaml --eval results.csv
# With baseline comparison
geval check --contract contract.yaml --eval results.csv --baseline baseline.json
# JSON output for CI/CD
geval check --contract contract.yaml --eval results.csv --jsonPass:
✓ PASS
Contract: quality-gate
Version: 1
All 1 eval(s) passed contract requirements
Block:
✗ BLOCK
Contract: quality-gate
Version: 1
Blocked: 2 violation(s) in 1 eval
Violations
1. quality-metrics → accuracy
accuracy = 0.72, expected >= 0.85
Actual: 0.72 | Baseline: 0.85
2. quality-metrics → latency_p95
latency_p95 = 250, expected <= 200
Actual: 250 | Baseline: 200
Geval supports multiple eval result formats:
- CSV - Raw data exports from any tool
- JSON - Normalized eval results or tool-specific formats
- JSONL - Line-delimited JSON files
Define CSV parsing directly in your contract - no separate config files needed:
sources:
csv:
metrics:
- column: accuracy
aggregate: avg
- column: latency
aggregate: p95
as: latency_p95
evalName:
fixed: my-evalCompare against:
- Fixed thresholds - Absolute quality gates
- Previous runs - Detect regressions
- Main branch - Compare against production baseline
Built-in adapters for popular eval tools:
- Promptfoo - Automatic format detection
- LangSmith - CSV export support
- OpenEvals - JSON format support
- Generic - Custom JSON formats
Geval now supports policy-based contracts that can evaluate signals beyond just eval metrics:
- Signal types:
eval,human_review,risk_flag,external_reference - Signal conditions: Match signals by type, name, or value
- Environment-specific rules: Different policies for dev/staging/prod
- Human decisions: Record approvals/rejections as signals
- Exit codes:
0PASS ·1BLOCK ·2REQUIRES_APPROVAL ·3ERROR - JSON output:
--jsonflag for programmatic use - Multiple files: Support for multiple eval result files
- Decision records: Automatic generation of decision artifacts with hashes
Evaluate contracts against eval results and enforce decisions.
geval check --contract <path> --eval <paths...> [options]Options:
-c, --contract <path>- Path to eval contract (YAML/JSON) (required)-e, --eval <paths...>- Path(s) to eval result files (CSV/JSON/JSONL) (required)-b, --baseline <path>- Path to baseline eval results for comparison--adapter <name>- Force specific adapter (promptfoo, langsmith, openevals, generic)--json- Output results as JSON--no-color- Disable colored output--verbose- Show detailed output
Examples:
# Basic check
geval check -c contract.yaml -e results.csv
# Multiple eval files
geval check -c contract.yaml -e results1.json results2.json
# With baseline comparison
geval check -c contract.yaml -e results.csv -b baseline.json
# JSON output for CI
geval check -c contract.yaml -e results.csv --jsonCompare eval results between two runs.
geval diff --previous <path> --current <path> [options]Options:
-p, --previous <path>- Path to previous eval results (required)-c, --current <path>- Path to current eval results (required)--json- Output results as JSON--no-color- Disable colored output
Example:
geval diff --previous baseline.json --current new.jsonExplain why a contract passed or failed with detailed analysis.
geval explain --contract <path> --eval <paths...> [options]Options:
-c, --contract <path>- Path to eval contract (YAML/JSON) (required)-e, --eval <paths...>- Path(s) to eval result files (required)-b, --baseline <path>- Path to baseline eval results--verbose- Show detailed explanations--no-color- Disable colored output
Example:
geval explain -c contract.yaml -e results.json --verboseValidate a contract file for syntax and semantic correctness.
geval validate <path> [options]Options:
<path>- Path to eval contract (YAML/JSON) (required)--strict- Enable strict validation (warnings become errors)--json- Output results as JSON
Example:
geval validate contract.yaml --strictRecord a human approval decision. Creates an approval artifact that can be used as a signal.
geval approve --reason <reason> [options]Options:
-r, --reason <reason>- Reason for approval (required)-o, --output <path>- Output file path (default:geval-approval.json)--by <name>- Name of approver (defaults to$USER)
Example:
geval approve --reason "Reviewed with product team" --by "alice"Record a human rejection decision. Creates a rejection artifact.
geval reject --reason <reason> [options]Options:
-r, --reason <reason>- Reason for rejection (required)-o, --output <path>- Output file path (default:geval-rejection.json)--by <name>- Name of reviewer (defaults to$USER)
Example:
geval reject --reason "Customer risk too high" --by "bob"Geval supports two contract formats:
version: 1 # Contract schema version (required)
name: string # Contract name (required)
description: string # Optional description
environment: production # Optional: production | staging | development
sources: # Optional: inline CSV/JSON parsing config
csv:
metrics:
- column: accuracy
aggregate: avg
evalName:
fixed: my-eval
required_evals: # Required: list of eval suites to check
- name: eval-name
description: string # Optional
rules: # Required: list of rules
- metric: accuracy
operator: ">="
baseline: fixed
threshold: 0.85
on_violation: # Required: violation handler
action: block # block | require_approval | warn
message: string # Optional custom message
metadata: # Optional: contract metadata
key: value # All values must be stringsversion: 1
name: policy-contract
environment: production
policy:
# Environment-specific policies
environments:
development:
default: pass # Default action when no rules match
production:
default: require_approval
rules:
# Eval-based rule
- when:
eval:
metric: accuracy
operator: ">="
baseline: fixed
threshold: 0.90
then:
action: pass
reason: "Accuracy meets threshold"
# Signal-based rule
- when:
signal:
type: risk_flag
field: level
operator: "=="
value: "high"
then:
action: block
reason: "High risk detected"
# Global rules (applied to all environments)
rules:
- when:
signal:
type: human_review
field: decision
operator: "=="
value: "rejected"
then:
action: block
reason: "Human review rejected"When parsing CSV files, you can aggregate metrics using:
| Method | Description |
|---|---|
avg |
Average of all values (default) |
sum |
Sum of all values |
min |
Minimum value |
max |
Maximum value |
count |
Count of non-null values |
p50 |
50th percentile (median) |
p90 |
90th percentile |
p95 |
95th percentile |
p99 |
99th percentile |
pass_rate |
Percentage of "success"/"pass"/true/1 values |
fail_rate |
Percentage of "error"/"fail"/false/0 values |
first |
First value |
last |
Last value |
| Operator | Description |
|---|---|
== |
Equal to |
!= |
Not equal to |
< |
Less than |
<= |
Less than or equal to |
> |
Greater than |
>= |
Greater than or equal to |
| Type | Description | Required Fields |
|---|---|---|
fixed |
Compare against a fixed threshold | threshold |
previous |
Compare against previous run | maxDelta (optional) |
main |
Compare against main branch baseline | maxDelta (optional) |
CSV Source Config:
sources:
csv:
metrics:
- column: accuracy # Column name
aggregate: avg # Aggregation method
as: accuracy_score # Optional: rename metric
evalName:
fixed: my-eval # Fixed eval name
# OR
# column: eval_name # Extract from column
runId:
fixed: run-123 # Fixed run ID
# OR
# column: run_id # Extract from columnJSON Source Config:
sources:
json:
metrics:
- column: accuracy
aggregate: first
evalName:
fixed: my-evalNote: JSON files in normalized format (with evalName, runId, metrics at top level) are auto-detected and don't require source config.
Signals are generic inputs to the decision engine beyond eval metrics:
{
"signals": [
{
"id": "signal-1",
"type": "risk_flag",
"name": "security-risk",
"value": { "level": "high" },
"metadata": { "source": "security-scan" }
},
{
"id": "signal-2",
"type": "human_review",
"name": "approval",
"value": { "decision": "approved", "by": "alice" }
}
]
}Signal Types:
eval- References eval result artifactshuman_review- Human approval/rejection decisionsrisk_flag- Risk level indicators (low/medium/high)external_reference- External URLs or references
Using Signals:
geval check --contract contract.yaml --eval results.csv --signals signals.jsonname: Eval Check
on: [pull_request]
jobs:
eval-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: "18"
- name: Run evals
run: npm run evals # Your eval command
- name: Install Geval
run: npm install -g @geval-labs/cli
- name: Check eval results
run: |
geval check \
--contract contract.yaml \
--eval eval-results.csveval-check:
image: node:18
script:
- npm install -g @geval-labs/cli
- npm run evals
- geval check --contract contract.yaml --eval eval-results.csv| Code | Status | Description |
|---|---|---|
0 |
PASS | All evals passed contract requirements |
1 |
BLOCK | Contract violated, release blocked |
2 |
REQUIRES_APPROVAL | Approval needed to proceed |
3 |
ERROR | Execution error |
npm install @geval-labs/coreimport {
parseContractFromYaml,
parseEvalFile,
evaluate,
formatDecision,
type Decision,
} from "@geval-labs/core";
import { readFileSync } from "fs";
// Load contract
const contractYaml = readFileSync("contract.yaml", "utf-8");
const contract = parseContractFromYaml(contractYaml);
// Parse eval results
const csvContent = readFileSync("results.csv", "utf-8");
const evalResult = parseEvalFile(csvContent, "results.csv", contract);
// Evaluate
const decision = evaluate({
contract,
evalResults: [evalResult],
baselines: {},
});
// Check result
console.log(decision.status); // "PASS" | "BLOCK" | "REQUIRES_APPROVAL"
// Format output
console.log(formatDecision(decision, { colors: true, verbose: true }));import { evaluate, type BaselineData } from "@geval-labs/core";
const decision = evaluate({
contract,
evalResults: [currentResult],
baselines: {
"my-eval": {
type: "previous",
metrics: baselineResult.metrics,
source: { runId: baselineResult.runId },
},
},
});import {
parseContractFromYaml,
parseEvalFile,
parseSignals,
evaluate,
createDecisionRecord,
} from "@geval-labs/core";
// Parse contract with policy
const contract = parseContractFromYaml(contractYaml);
// Parse eval results
const evalResult = parseEvalFile(csvContent, "results.csv", contract);
// Parse signals
const signalsData = JSON.parse(fs.readFileSync("signals.json", "utf-8"));
const { signals } = parseSignals(signalsData);
// Evaluate with signals
const decision = evaluate({
contract,
evalResults: [evalResult],
baselines: {},
signals,
environment: "production",
});
// Create decision record
const record = createDecisionRecord({
decision,
environment: "production",
contract,
evalResults: [evalResult],
signals,
});Core Functions:
parseContract(data)- Parse contract from objectparseContractFromYaml(yaml)- Parse contract from YAML stringvalidateContract(contract)- Validate contract semanticsevaluate(input)- Evaluate contract against resultsparseEvalFile(content, filePath, contract)- Parse eval file (CSV/JSON/JSONL)parseEvalResult(data)- Parse normalized eval resultdiffEvalResults(previous, current)- Compare eval resultsdiffContracts(previous, current)- Compare contractsformatDecision(decision, options)- Format decision output
Adapters:
GenericAdapter- Generic JSON adapterPromptfooAdapter- Promptfoo format adapterLangSmithAdapter- LangSmith format adapterOpenEvalsAdapter- OpenEvals format adapterdetectAdapter(data)- Auto-detect adapterparseWithAdapter(data, adapterName)- Parse with specific adapter
Types:
EvalContract- Contract typeDecision- Evaluation decisionViolation- Rule violationNormalizedEvalResult- Normalized eval resultBaselineData- Baseline data structureSignal- Signal typeSignalCollection- Collection of signalsHumanDecision- Human approval/rejection decisionDecisionRecord- Decision record with hashesPolicy- Policy-based contract structure
We provide comprehensive examples in the examples/ directory:
- Example 1: Performance Monitoring - API performance metrics (latency, throughput, error rates)
- Example 2: Safety & Compliance - AI safety metrics (toxicity, bias, PII leakage)
- Example 3: Multi-Eval Comparison - Comprehensive evaluation across multiple test suites
- Complete Walkthrough - Full workflow demonstration
Run all examples:
cd examples
npm install
npx tsx run-all-examples.ts| Package | Description | npm |
|---|---|---|
@geval-labs/cli |
Command-line interface | |
@geval-labs/core |
Core library for programmatic use |
- Node.js >= 18.0.0
- npm >= 9.0.0
# Clone repository
git clone https://github.com/geval-labs/geval.git
cd geval
# Install dependencies
npm install
# Build packages
npm run build
# Run tests
npm test
# Run linting
npm run lintpackages/
├── core/ # Core library: contract parsing, evaluation engine, adapters
└── cli/ # Command-line interface
examples/ # Complete working examples
├── example-1-performance/
├── example-2-safety/
├── example-3-multi-eval/
└── complete-walkthrough/
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
MIT © Geval Contributors
Website • GitHub • npm • Documentation