| title | RVEDA RCM ARENA Environment Server | |||
|---|---|---|---|---|
| emoji | 🏥 | |||
| colorFrom | green | |||
| colorTo | blue | |||
| sdk | docker | |||
| pinned | false | |||
| app_port | 8000 | |||
| base_path | /web | |||
| tags |
|
RVEDA RCM ARENA is an OpenEnv benchmark for training cautious medical coding agents under partial observability, schema rules, and policy drift.
In real revenue-cycle workflows, one-shot coding is unsafe; the agent has to reveal evidence, verify rules, adapt to changing claim requirements, and submit a grounded claim rather than guess from incomplete context.
Live Space:anirudw/rveda-rcm-arenaTraining Notebook:train_generated_v2_grpo_launcher.ipynbTraining Script:train_grpo_smoke.pyReward/Loss Plots:Training EvidenceBlog:Blog.MDRound 1 Baseline:anirudw/rveda
Medical coding is not just a classification task. In real revenue-cycle workflows, an agent has to work under incomplete evidence, evolving payer rules, and claim formatting constraints. A benchmark that rewards only the final code can easily reward the wrong behavior: unsupported specificity, shortcut retrieval, and submission without enough evidence.
RVEDA RCM ARENA is built to test the opposite behavior. The agent must reveal hidden chart evidence, search the ICD candidate space, inspect code details, check policy/schema requirements, and only then submit a grounded claim. That makes the benchmark more representative of professional work than a static label lookup task.
OpenEnv is the right fit because the problem is fundamentally an interaction loop, not a single forward pass. The environment needs structured actions, structured observations, explicit reward signals, and a clean server/client boundary that judges can rerun.
- uncover hidden chart evidence from the EHR
- search/select the correct code from a large candidate space
- adapt to policy/schema drift and claim-format constraints
- submit a valid grounded claim
- Fog-of-War EHR: decisive evidence is hidden behind
QUERY_EHRrather than visible at reset. - Policy/schema drift: claim requirements can change mid-episode and the agent has to adapt.
- Verifier-based cautious reward design: the environment exposes structured reward metrics for correctness, grounding, schema compliance, format validity, process discipline, and drift adaptation.
Current state:
- the partial-observability and verifier mechanics are implemented
- the Colab training path is rerunnable
- the strongest current evidence is still smoke-level rather than large-scale training, but now shows a real completion-oriented gain over the scripted baseline
| Area | Current implementation |
|---|---|
| Action space | SEARCH, DETAILS, QUERY_EHR, CHECK_POLICY, VALIDATE_CLAIM_SCHEMA, REASONING_LOG, SUBMIT |
| Observation structure | patient note, search results, detailed info, ehr_map, revealed evidence, policy state, drift notice, reward metrics, error fields |
| Drift behavior | policy/schema drift can change required fields mid-episode |
| Success condition | submit the correct code with the required evidence and schema-valid workflow state |
The current judge-facing training path is intentionally small-model first so it can be rerun in Colab instead of depending on one large-model attempt.
- Generated V2 Colab launcher:
train_generated_v2_grpo_launcher.ipynb - Smoke Colab launcher:
train_grpo_smoke_launcher.ipynb - Training runner:
train_grpo_smoke.py - Technical blog:
Blog.MD
Current setup:
- model used:
Qwen/Qwen2.5-1.5B-Instruct - trainer: TRL GRPO with a plain-TRL fallback path
- reward wiring: live environment reward through the training bridge, not a static offline label file
- evidence level: smoke-level but real, rerunnable, artifact-producing, and now strong enough to show a meaningful baseline comparison
Recommended order:
- Sync the repo in Colab.
- Run
python -m pytest -q. - Run
openenv validate. - Start with the small-model preset before attempting larger runs.
- Confirm that the run produces saved artifacts under
artifacts/.
Current primary smoke-run evidence:
| Run | Model | Tasks | Train steps | Mean total reward | SUBMIT count |
Search-to-submission | Timeouts |
|---|---|---|---|---|---|---|---|
| Baseline policy | Qwen/Qwen2.5-1.5B-Instruct |
4 |
0 |
0.810125 |
4 / 8 |
3.25 |
0.5 |
| Trained policy | Qwen/Qwen2.5-1.5B-Instruct |
4 |
8 |
1.31500 |
8 / 8 |
1.0 |
0.0 |
Comparison summary:
| Metric | Value |
|---|---|
| Colab GPU | Tesla T4 |
| Reward delta | +0.504875 |
| Grounding F1 proxy delta | -0.00654 |
| Interpretation | smoke-level improvement over the scripted baseline, driven mainly by higher submission completion and lower timeout rate |
Honest interpretation:
- This is real proof that training executed, produced artifacts, and improved end-to-end policy behavior on the current smoke evaluation.
- The trained 1.5B policy now completes
8 / 8evaluation episodes withSUBMIT, while the scripted baseline only completes4 / 8and times out on half of them. - This is still smoke-scale evidence, not a final performance claim: grounding is still measured through a proxy, and drift/schema headline metrics are not yet surfaced in the saved artifact summary.
Available verifier-facing metrics in the current run:
- grounding F1 proxy
- search-to-submission ratio
- timeout frequency
- trained-minus-baseline verifier deltas
- trainer-reported
train_loss:0.0over8train steps
Metrics that are still early / unavailable in the current smoke run:
- drift adaptation rate
- schema validation pass rate
Most informative plots:
Phase 3 baseline-versus-trained reward comparison for the strongest current smoke run; the trained 1.5B policy materially outperforms the scripted baseline on completion-oriented behavior in this evaluation slice.
Earlier smoke-run reward comparison from the previous successful run; this helps show how the reward picture changed across iterations rather than only within the final Phase 3 result.
Training loss from the earlier successful smoke run; included here so the Phase 3 loss curve can be read as part of an iteration sequence rather than as a standalone plot.
GRPO training loss by logged training step for the current Qwen/Qwen2.5-1.5B-Instruct Colab run.
The Phase 3 verifier-metric plot is still included in the repo at docs/plots/verifier_metrics_plot_p3.png, but it is less visually informative than the reward and loss plots because the strongest changes are already captured more clearly by submission completion and timeout behavior.
Quickstart:
- Install:
uv syncor the Colab install cells intrain_generated_v2_grpo_launcher.ipynb - Local run:
python -m pytest -q,openenv validate, then run the training script or notebook - Space run: use the linked Hugging Face Space above
- Colab rerun: start from the generated training notebook
- Artifact output location:
artifacts/
| Area | Path |
|---|---|
| Environment code | server/rveda_environment.py, server/policy_engine.py, server/reward_engine.py |
| Task generation | generate_cases.py, examples/ |
| Training script | train_grpo_smoke.py |
| Notebook | train_generated_v2_grpo_launcher.ipynb, train_grpo_smoke_launcher.ipynb |
| Artifacts / plots | artifacts/, docs/plots/ |
| Validation / utilities | check-readiness.py, validate-submission.sh |
Already proven:
- the OpenEnv environment is live and validated
- the Colab training path reruns on a
Tesla T4 - the strongest current smoke run beats the scripted baseline on end-to-end completion and mean total reward
- reward, loss, and verifier artifacts are saved for review
Still early:
- the current result is still based on a small smoke evaluation rather than a broad curriculum
- grounding is still tracked through a proxy and regresses slightly in the strongest run
- drift adaptation rate and schema validation pass rate are not yet strong headline metrics in the saved smoke artifacts
- the current reward evidence is stronger than mere pipeline proof, but not yet large-scale learning evidence
Next steps:
- strengthen reward shaping and curriculum so the trained policy can preserve its completion gains while improving grounding quality
- expand generated tasks while keeping the Colab path rerunnable
- surface drift/schema metrics more clearly in the evaluation artifacts
- only then scale training beyond the small-model-first setup
Medical coding sits inside a much larger operational and financial surface area. In a JAMA time-driven costing study, billing and insurance-related activities were estimated at $20 to $215 per encounter and 3% to 25% of professional revenue, depending on encounter type, even in a large academic health system with a certified EHR. In Medicare Advantage, the March 2024 MedPAC report estimated that Medicare would spend 22% more for MA enrollees than comparable FFS beneficiaries in 2024, a projected $83 billion gap, with coding intensity alone projected to add about $50 billion in payments. A 2025 Health Affairs Scholar study found an enrollment-weighted mean coding inflation rate of 8.4%, with 68.1% of MA enrollees in contracts above Medicare's coding-intensity adjustment.
Those studies do not imply that a lightweight benchmark solves system-level payment integrity or administrative waste. They do support a narrower claim: diagnosis coding and billing workflow quality are financially material, and a benchmark that rewards grounded, cautious coding behavior is studying a real operational problem rather than a toy label task.
A benchmark that rewards only the final label risks training exactly the wrong behavior: hallucinating or overly aggressive agents that maximize apparent specificity without grounding. RVEDA RCM ARENA is designed to test the opposite behavior: grounded, stepwise coding decisions in which retrieval and verification are part of the task, not optional post-processing.
RVEDA RCM ARENA is designed to answer a concrete research question:
Can an LLM agent behave like a cautious medical coder, rather than a one-shot label generator?
This framing matters for benchmark design:
- It tests clinical reasoning, not only label recall.
- It tests search efficiency, because the agent must retrieve and inspect evidence before submission.
- It penalizes hallucinated or over-aggressive coding behavior by making verification part of the interaction loop.
- It supports human-in-the-loop auditing, because each step leaves an explicit interaction trace.
Established platforms such as FraudLens, Cotiviti, and Optum FWA address a different layer of the problem: post hoc detection of fraud, waste, abuse, and anomalous billing behavior across large claims datasets.
RVEDA RCM ARENA addresses a different question. It is a pre-deployment benchmark for agentic medical coding systems, designed to test whether an AI model arrives at a code through grounded clinical reasoning before deployment.
That distinction is important. Statistical anomaly detection evaluates aggregate billing behavior across populations and claims streams; RVEDA RCM ARENA evaluates the reasoning trajectory of an individual AI agent as it searches, inspects evidence, and commits to an ICD-10 code. In that sense, RVEDA RCM ARENA is complementary to enterprise auditing systems: those platforms help catch problematic claims after the fact, while RVEDA RCM ARENA is designed to test whether an autonomous coding agent should be trusted before deployment.
Each episode starts with a patient note and ends when the agent submits an ICD-10 code or exhausts the episode budget.
The action space is deliberately small and tool-like:
SEARCH(query): query the local ICD-10 index for candidate codes.DETAILS(code): retrieve long-form code details and exclusion notes.QUERY_EHR(module, query): reveal evidence from one hidden EHR module in the minimal Task 1.3 V2 slice.CHECK_POLICY(): reveal the active payer policy version and current claim-schema requirements.VALIDATE_CLAIM_SCHEMA(payload): validate a draft claim against the active schema without ending the episode.REASONING_LOG(payload): submit a grounded reasoning record that cites revealed evidence before final submission.SUBMIT(code): finalize the coding decision and end the episode.
This setup mimics the operational logic of medical coding review: reveal hidden evidence when needed, retrieve candidates, inspect details, check the active claim rules, validate a draft claim, record grounded reasoning, then commit.
RVEDA RCM ARENA consists of three core layers: a local retrieval engine, an environment wrapper with grading logic, and a reference inference loop.
server/engine.py is the retrieval backend used by the environment.
initialize_db()createsdata/icd10.dband seeds it fromicd10_mock.json.- The SQLite table stores
code,short_desc,long_desc, andexcludes. search_codes(query, limit=5)performs lexical retrieval overshort_descandlong_descusing SQLiteLIKEmatching and returns compact candidate summaries.get_code_details(code)performs exact-code lookup and returns long description plus exclusion notes.
This design is intentionally simple and reproducible. The database is local, deterministic, and fast enough for benchmark-grade evaluation without introducing external search infrastructure.
server/rveda_environment.py wraps the engine in an OpenEnv-compatible task environment.
- On startup, it calls
initialize_db()so the packaged SQLite database is ready before episodes begin. reset()loads a task fromtasks.jsonand exposes the patient note as the initial observation.step()routes each action to the proper backend operation and returns structured observations containing search results, detailed code context, reward, and grading metadata.
The environment also records a rich GradingTrace, including:
- difficulty tier,
- search history,
- code inspection history,
- most recent search candidates,
- reward components,
- conflict flags such as
Excludes1mismatches.
This makes RVEDA RCM ARENA useful not only for final-score benchmarking, but also for trajectory-level analysis of how an agent reasoned through the task.
inference.py provides the benchmark submission loop.
At runtime it:
- Reads task IDs from
tasks.jsonor fromRVEDA_TASK/RVEDA_TASK_IDS. - Creates an OpenAI-compatible client using
HF_TOKEN(orAPI_KEY),API_BASE_URL, andMODEL_NAME. - Launches the environment with
RvedaEnv.from_docker_image(IMAGE_NAME). - Resets into an episode and builds a prompt from the current patient note, search results, detailed info, policy state, drift notice, reasoning-log status, and recent action history.
- Asks the model to emit strict JSON with one of seven actions:
SEARCH,DETAILS,QUERY_EHR,CHECK_POLICY,VALIDATE_CLAIM_SCHEMA,REASONING_LOG, orSUBMIT. - Executes the action in the environment, logs
[START],[STEP], and[END]lines, and repeats until termination.
The loop is intentionally benchmark-friendly: it is deterministic in structure, OpenAI-client compliant, and emits normalized episode scores for consistent downstream evaluation.
RVEDA RCM ARENA is designed around two measurable axes:
- Accuracy: did the agent submit the correct ICD-10 code, or at least the correct code family?
- Efficiency: how economically did the agent search, inspect, and commit within a bounded number of steps?
The environment also exposes rubric-level reward_metrics so terminal correctness, grounding, schema compliance, format validity, process discipline, and drift adaptation can be inspected independently.
Submission quality is the dominant grading signal.
- Exact-code submissions receive the highest base reward.
- Same-family submissions receive partial credit.
- Incorrect-family submissions receive a lower score.
This reflects a realistic coding hierarchy: selecting the right diagnostic family is better than an unrelated code, but full specificity still matters.
RVEDA RCM ARENA also scores process quality before submission.
- Novel, productive searches earn small bonuses.
- Relevant detail lookups earn additional reward.
- Repeated low-value exploration stops improving the score.
- Episodes are capped at 8 steps, and failing to submit within budget ends the episode.
In benchmark terms, this acts as a step penalty: extra actions consume the fixed interaction budget, reduce the value of aimless search, and increase the risk of timing out before a valid SUBMIT. Faster, better-grounded coding trajectories therefore outperform slow or repetitive ones.
The final episode score reported by inference.py is normalized to a bounded 0-1 scale so tasks remain comparable across runs.
Episode rewards are standardized before reporting, which keeps evaluation stable while preserving relative ranking between stronger and weaker coding trajectories.
The benchmark therefore rewards:
- correct final coding decisions,
- efficient evidence gathering,
- auditable trajectories,
- compliance with strict evaluation contracts.
Tasks are defined in tasks.json as JSON objects with four fields:
task_iddifficultypatient_notetarget_code
The current benchmark ships with a simple 3-tier structure:
| Tier | Example Task | Clinical Pattern | Target Code |
|---|---|---|---|
| Easy | easy_endo_1 |
Routine visit with elevated BMI and weight-management counseling | E66.3 |
| Medium | medium_endo_1 |
Autoimmune hypothyroid presentation consistent with Hashimoto's disease | E06.3 |
| Hard | hard_cardio_1 |
Acute myocardial infarction presentation in the emergency setting | I21.9 |
This tiered task structure is useful for benchmarking both capability and scaling behavior: simple lexical retrieval may be enough on easy cases, while harder tasks require better grounding and more disciplined tool use.
- Python 3.10+
- Docker
uvor a compatible Python package installer
uv syncThe environment initializes the database automatically at startup, but you can also prebuild it explicitly:
python -c "from server.engine import initialize_db; initialize_db()"This creates data/icd10.db from the mock ICD-10 records in icd10_mock.json.
docker build -t rveda-env:latest -f Dockerfile .openenv validateuvicorn server.app:app --host 0.0.0.0 --port 8000Set the required environment variables first.
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="<your_hf_token>"
export IMAGE_NAME="rveda-env:latest"
python inference.py$env:API_BASE_URL = "https://router.huggingface.co/v1"
$env:MODEL_NAME = "Qwen/Qwen2.5-72B-Instruct"
$env:HF_TOKEN = "<your_hf_token>"
$env:IMAGE_NAME = "rveda-env:latest"
python inference.pyOptional task controls:
RVEDA_TASK=<task_id>runs a single task.RVEDA_TASK_IDS=<task_a,task_b,...>runs a selected task set.
During execution, inference.py prints benchmark-compatible logs in the form:
[START] task=<task_name> env=<benchmark> model=<model_name>
[STEP] step=<n> action=<action> reward=<0.00> done=<true|false> error=<msg|null>
[END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...>
server/engine.py: SQLite-backed ICD-10 retrieval and detail lookupserver/rveda_environment.py: environment state machine and reward shapingserver/app.py: FastAPI / OpenEnv server wrapperclient.py: client interface for interacting with the environmentmodels.py: action, observation, and grading schemasinference.py: OpenAI-client baseline looptasks.json: benchmark task suiteicd10_mock.json: mock ICD-10 source datadata/icd10.db: generated SQLite database used at runtime; created locally and not committeddocs/rveda-v2-contract.md: frozen V2 Task 0.1 schema and verifiability contractexamples/v2_task_minimal.json: minimal V2 curriculum slice example
RVEDA RCM ARENA is a benchmarking environment, not a production clinical coding system.
- The current ICD-10 corpus is mock data.
- Retrieval is lexical and SQLite-backed rather than semantic or ontology-scale.
- The SQLite backend is an intentional benchmark constraint: it keeps the environment local, deterministic, lightweight, and reproducible while forcing agents to reason under limited search conditions.
- The included agent loop is a baseline, not a claim of clinical deployment readiness.
Those constraints are a feature, not a flaw: they keep the benchmark controlled, portable, and easy to reproduce while still exercising the core reasoning loop of agentic medical coding.
RVEDA RCM ARENA is well suited for:
- benchmarking LLM agents on coding accuracy under constrained search,
- comparing single-agent and multi-agent coding strategies,
- studying tool-use efficiency under a fixed step budget,
- auditing reasoning traces in human-in-the-loop evaluation,
- testing stable scoring pipelines for controlled benchmark environments.