feat(serve): expose env server / worker stats on /metrics Prometheus endpoint#1415
Open
mvanhorn wants to merge 1 commit into
Open
feat(serve): expose env server / worker stats on /metrics Prometheus endpoint#1415mvanhorn wants to merge 1 commit into
mvanhorn wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit df7e644. Configure here.
| log_level=get_log_level(config.verbose), | ||
| log_dir=log_dir, | ||
| console_logging=config.disable_tui, | ||
| metrics_port=config.metrics_port, |
There was a problem hiding this comment.
OpenEnv start_server rejects metrics_port
High Severity
run_evaluation always passes metrics_port into start_server, but OpenEnvEnv.start_server does not declare that keyword and does not forward it to Environment.start_server. Evaluations that use OpenEnvEnv with the env server enabled raise TypeError on startup, even when metrics are disabled.
Reviewed by Cursor Bugbot for commit df7e644. Configure here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Summary
verifiers/serve/server/metrics.py(new)verifiers/serve/server/env_router.pyExpose the live snapshot via a public
statsproperty soMetricsServercan read it without poking private state. Currently the router builds anEnvRouterStatsonly insidelog_stats()— refactor to a@property(no functional change to the periodic logger;log_stats()callsself.stats).verifiers/serve/server/env_server.pyAdd
metrics_port: int | None = NonetoEnvServer.__init__. Inrun_server(), ifmetrics_portis not None, instantiate andawait metrics_server.start()after the router boots, andawait metrics_server.close()during shutdown alongside the existingawait self.router.close().verifiers/scripts/serve.py(if it exists; otherwise wherever env-server CLI lives — seeverifiers/cli/commands/)Add
--metrics-portargparse flag plumbed through toEnvServer(metrics_port=...). DefaultNone.Why this matters
Issue #1188 (filed by @mikasenghaas, MEMBER) asks for "Prometheus metrics logging from env server/worker." The body is empty; the title carries the full ask. The env server already maintains rich per-worker stats (
EnvWorkerStats:worker_id,timestamp,active_tasks, event-loop lag) and aggregate router stats (EnvRouterStats), and the router logs them on astats_log_interval(default 10s). What is missing is a scrape-friendly transport: a Prometheus-format text endpoint that operators can wire into existing monitoring infra. The ZMQ ROUTER transport is the only client-facing surface today, which is opaque to standard exporters.Acceptance:
--metrics-portflag (and matchingmetrics_portconfig) on the env server CLI exposes a Prometheus text endpoint onhttp://0.0.0.0:<port>/metrics.EnvRouterStats+EnvWorkerStats:verifiers_env_active_tasks,verifiers_env_workers_total,verifiers_env_worker_active_tasks{worker_id="N"},verifiers_env_loop_lag_seconds{worker_id="N",quantile="p50|p95|p99"}, plus averifiers_env_server_info{env_id=...,version=...}gauge.--metrics-portis omitted (no HTTP server started).Testing
tests/test_metrics_server.py(new)Run with
uv run pytest tests/test_metrics_server.py -v.Fixes #1188
AI was used for assistance.
Note
Medium Risk
Adds a new network-facing HTTP listener and threads it through env server startup/shutdown and eval configuration, so misconfiguration (port conflicts/exposure) or handler bugs could impact availability, though default behavior remains unchanged when disabled.
Overview
Adds an optional asyncio HTTP
MetricsServer(verifiers/serve/server/metrics.py) that serves Prometheus text metrics at/metrics, including env identity labels, aggregate/per-worker active tasks, and router/worker event-loop lag quantiles.Plumbs
metrics_portthrough env server lifecycle:Environment.start_server(...)forwards the port to the server process,EnvServerconditionally starts/stops the metrics server, andEnvRouternow exposes astatssnapshot property consumed by the endpoint (and reused by existing periodic logging).Extends evaluation configuration/CLI to accept
--metrics-portand persist it inEvalConfig, and updates loop-lag stats to compute/reportp95; adds unit tests validating Prometheus rendering and HTTP behavior (200 on/metrics, 404 otherwise).Reviewed by Cursor Bugbot for commit df7e644. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Expose env server and worker stats on a Prometheus
/metricsendpointmetrics.pywithrender_prometheus_textandMetricsServer: an asyncio HTTP server that serves Prometheus text at/metrics(404 for all other paths), including active tasks, worker count, and event loop lag quantiles (p50, p95, p99).statsproperty toEnvRouterthat returns a structuredEnvRouterStatssnapshot used by both the metrics endpoint and existing log output.metrics_portthroughEvalConfig, the eval CLI (--metrics-port), TOML config loading,Environment.start_server, andEnvServerso the endpoint is started automatically when the port is configured.p95toEventLoopLagStatsto support the new quantile reporting.Macroscope summarized df7e644.