| title | OASIS | |
|---|---|---|
| emoji | π | |
| colorFrom | blue | |
| colorTo | green | |
| sdk | docker | |
| pinned | false | |
| app_port | 8000 | |
| tags |
|
An OpenEnv reinforcement learning environment for training AI agents to manage insulin dosing in Type 1 Diabetes
4 tasks Β· 19-field observation space Β· gamma-CDF pharmacokinetics Β· live interactive dashboard
Live Demo Β· API Docs Β· Quick Start
Type 1 Diabetes affects over 9 million people worldwide. These patients produce zero insulin and must deliver it externally every few minutes β a decision that is unforgiving in both directions:
| Condition | Glucose Level | Consequence | |
|---|---|---|---|
| Mild Hyperglycemia | > 180 mg/dL | Progressive organ damage over months | |
| π΄ | Severe Hyperglycemia | > 250 mg/dL | Diabetic ketoacidosis β emergency |
| Mild Hypoglycemia | < 70 mg/dL | Confusion, tremors, impaired function | |
| π | Severe Hypoglycemia | < 54 mg/dL | Seizures, loss of consciousness, death within minutes |
The clinical gold standard is Time-in-Range (TIR): the percentage of time glucose stays within 70β180 mg/dL. Guidelines recommend β₯70%. Most patients achieve far less.
Current commercial insulin pumps use PID controllers β rule-based systems tuned for an "average" patient. But no patient is average. A child may be 5Γ more insulin-sensitive than an adult. Exercise changes sensitivity unpredictably. Illness causes resistance. These controllers fail silently, and patients pay the price.
OASIS exists to train RL agents that adapt where PID controllers cannot.
OASIS is not a toy environment. Every design decision is grounded in clinical physiology:
| Feature | Implementation | Clinical Basis |
|---|---|---|
| FDA-accepted simulator | UVa/Padova T1D model via simglucose | Gold standard for in-silico T1D research |
| 30 virtual patients | Adolescents, adults, children with distinct physiology | Real inter-patient variability |
| CGM noise | Ο=10 mg/dL Gaussian on subcutaneous glucose (Gsub) | ISO 15197 accuracy standard |
| Gamma-CDF pharmacokinetics | IOB modelled with gamma distribution (peak 55 min, clear 8 hrs) | Rapid-acting insulin absorption profile (Lispro/Aspart) |
| Exercise physiology | 20β70% insulin sensitivity increase during activity | Skeletal muscle glucose transport |
| Illness simulation | 1.5β2.5Γ insulin resistance at unknown onset | Inflammatory insulin receptor downregulation |
| Asymmetric reward | Hypo penalised 2β6Γ heavier than hyper | Acute vs. cumulative clinical risk |
| Recovery bonus | +0.5 for hypo correction, +0.3 for hyper within 10 steps | Incentivises active clinical management |
At each 3-minute step, the agent observes a 19-field clinical state and outputs a 2D continuous action:
βββββββββββββββββββββββββββββββββββββββββββ
β OBSERVATION (19 fields) β
β β
β CGM glucose (noisy) ββββββββ 142.3 mg/dLβ
β Glucose trend ββββββββββββββββ rising β
β 12-reading history window βββ [138, ...] β
β Meal announced? ββββββββββββββββ Yes β
β Meal carbs βββββββββββββββββββββ 70g β
β Exercise intensity ββββββββββββββ 0.0 β
β Insulin-on-board (gamma-CDF) ββ 2.4 U β
β Time of day ββββββββββββββββββββ 10.0 h β
β ... and 11 more fields β
ββββββββββββββββββ¬βββββββββββββββββββββββββ
β
ββββββΌβββββ
β AGENT β
ββββββ¬βββββ
β
ββββββββββββββββββΌβββββββββββββββββββββββββ
β ACTION (2 fields) β
β β
β Basal rate ββββββββ 1.2 U/hr (0.0β5.0) β
β Bolus dose ββββββββ 5.0 U (0.0β20.0) β
ββββββββββββββββββ¬βββββββββββββββββββββββββ
β
ββββββββββββββββββΌβββββββββββββββββββββββββ
β REWARD (6 components) β
β β
β In-range bonus ββββββββββββββββ +1.0 β
β Hypo penalty ββββββββββ -1.0 to -3.0 β
β Hyper penalty ββββββββ -0.5 to -1.5 β
β Overdose penalty ββββββββββββ -3.0 β
β Recovery bonus βββββββ +0.3 to +0.5 β
β Step total ββββββββββββ sum of above β
βββββββββββββββββββββββββββββββββββββββββββ
| Task | Name | Difficulty | Patient | Meals | Exercise | Illness |
|---|---|---|---|---|---|---|
| 1 | Basal Rate Control | π’ Easy | adult#001 | None | None | None |
| 2 | Meal Bolus Timing | π‘ Medium | adult#001 | 3 announced | Announced | None |
| 3 | Cross-Patient Generalisation | π΄ Hard | Random/30 | 3 unannounced | Random | None |
| 4 | Sick Day Management | β« Expert | Random/30 | 3 unannounced | Random | 1.5β2.5Γ resistance |
Task 1 establishes baseline control β keep glucose stable with basal insulin only.
Task 2 introduces meal management. Three daily meals (50g/70g/80g CHO) are announced 30 minutes ahead. The agent must learn pre-meal bolus timing. A moderate exercise event at step 150 (also announced) tests exercise-aware dosing.
Task 3 tests generalisation. A random patient from 30 profiles β children who are 5Γ more sensitive, adults, adolescents. Meals and exercise are unannounced. Patient identity is hidden. The agent must infer physiology from glucose dynamics alone.
Task 4 is genuinely frontier-level. A random patient develops illness causing 1.5β2.5Γ insulin resistance at an unknown time. The agent is never told. It must detect rising glucose despite normal dosing, infer that insulin has become less effective, and increase delivery without over-correcting. PID controllers fail catastrophically on this task β an RL agent that succeeds here would represent a clinically meaningful advance.
All results deterministic (seed=42), reproducible via python eval.py:
| Agent | Task 1 | Task 2 | Task 3 | Task 4 |
|---|---|---|---|---|
| Constant Basal (no intelligence) | 1.000 | 0.000 | 0.345 | ~0.050 |
| PID Controller (clinical standard) | 1.000 | 0.736 | 0.206 | ~0.120 |
| Target: Good RL Agent | β₯ 0.95 | β₯ 0.70 | β₯ 0.60 | β₯ 0.45 |
Key insight from Task 3: The PID controller scores worse than constant basal (0.206 vs 0.345) because its adult-tuned aggressive corrections cause fatal hypoglycemia in 4 of 5 child/adolescent patients. A "smarter" fixed controller is more dangerous than a conservative one when patient physiology varies. This is exactly why adaptive RL agents are needed.
| Field | Type | Description |
|---|---|---|
glucose_mg_dl |
float | CGM reading with ISO 15197 noise (Ο=10 mg/dL) |
glucose_trend |
string | rapidly_falling / falling / stable / rising / rapidly_rising |
glucose_history_window |
list[float] | Last 12 CGM readings (36 min context) |
meal_announced |
bool | Meal within 30 min (Task 2 only) |
meal_grams_announced |
float | Carbs in announced meal |
exercise_intensity |
float | Current exercise (0=rest, 1=max) |
exercise_announced |
bool | Exercise within 30 min (Task 2 only) |
insulin_on_board_units |
float | Active insulin via gamma-CDF PK model |
time_of_day_hours |
float | Simulated time (0.0β24.0) |
step |
int | Current step (0β479) |
patient_id |
string/null | Hidden in Task 3/4 |
last_action_basal |
float | Previous basal rate |
last_action_bolus |
float | Previous bolus dose |
true_glucose_mg_dl |
float/null | Pre-noise glucose (research/debug) |
illness_active |
bool | Debug only β always False in normal mode |
| Glucose Zone | Component | Value | Rationale |
|---|---|---|---|
| 70β180 mg/dL | TIR contribution | +1.0 | Target range |
| 54β70 mg/dL | Hypo penalty | β1.0 | Dangerous |
| < 54 mg/dL | Severe hypo | β3.0 | Life-threatening |
| 180β250 mg/dL | Hyper penalty | β0.5 | Long-term damage |
| > 250 mg/dL | Severe hyper | β1.5 | Acute risk |
| < 54 + recent bolus | Overdose | β3.0 | Prevents reward hacking |
| Hypo corrected β€10 steps | Recovery bonus | +0.5 | Rewards active correction |
| Hyper corrected β€10 steps | Recovery bonus | +0.3 | Rewards proactive management |
Unlike simple exponential decay models, OASIS uses a gamma-distribution cumulative absorption curve matching the pharmacokinetic profile of rapid-acting insulin (Lispro/Aspart/Fiasp):
IOB(t) = Ξ£ insulin_dose[i] Γ (1 β FΞ³(t β t_injection[i]))
Where FΞ³ is the gamma CDF with shape k=2, peak at 55 minutes. The model tracks 160 steps (8 hours) of insulin delivery history β both basal and bolus β and computes the fraction NOT YET absorbed at each time offset. This produces realistic IOB curves: a 10U bolus shows 10.0U immediately, 6.3U at 90 minutes, 2.4U at 4 hours.
Commercial artificial pancreas systems (Medtronic 780G, Tandem Control-IQ, Omnipod 5) display IOB as a primary safety signal to prevent bolus stacking. OASIS gives RL agents the same information.
git clone https://github.com/saksham1771/glucorl.git
cd glucorl
pip install -r requirements.txt
uvicorn server.app:app --port 8000
# Open the interactive dashboard
open http://localhost:8000docker build -t oasis .
docker run -p 8000:8000 oasisfrom client import GlucoEnv
from models import GlucoAction
with GlucoEnv(base_url="http://localhost:8000") as env:
result = env.reset(task_id=2)
while not result.done:
obs = result.observation
action = GlucoAction(
basal_rate=1.2,
bolus_dose=5.0 if obs.meal_announced else 0.0
)
result = env.step(action)
state = env.state()
print(f"TIR: {state.tir_current:.1%}")export GLUCORL_ENV_URL="http://localhost:8000"
export API_BASE_URL="https://router.huggingface.co/v1"
export HF_TOKEN="hf_your_token"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
python inference.py| Method | Endpoint | Description |
|---|---|---|
| GET | / |
Interactive web dashboard (WebSocket-based, real-time) |
| POST | /reset |
Start episode. Body: {"task_id": 1} (1β4) |
| POST | /step |
Take action. Body: {"basal_rate": 1.0, "bolus_dose": 0.0} |
| GET | /state |
Full episode state with glucose history and metrics |
| GET | /tasks |
List all 4 tasks with descriptions |
| POST | /grade |
Detailed decomposed score breakdown |
| GET | /health |
Health check |
| WS | /ws |
WebSocket for persistent sessions |
| GET | /docs |
Swagger API documentation |
OASIS is designed for GRPO training via TRL:
- Dense reward: every step produces signal (+1.0 to β6.0 range)
- Continuous action space: 2D (basal + bolus) β amenable to policy gradient methods
- 480-step episodes: long enough for meaningful trajectories, short enough for fast iteration
- 4-task curriculum: natural difficulty progression for progressive training
- Reward variance: successful episodes score +300 to +480, failed episodes β200 to β500 β GRPO needs this spread
The glucose_history_window (12 readings) enables feedforward agents to reason temporally without RNN architectures. Full history is available via /state for agents that prefer complete episode context.
oasis/
βββ inference.py # Baseline inference (OpenAI client)
βββ models.py # Pydantic: Action, Observation, State, Reward
βββ client.py # WebSocket client (EnvClient)
βββ eval.py # PID vs baseline benchmark
βββ openenv.yaml # OpenEnv spec (4 tasks)
βββ Dockerfile # HF Spaces (openenv-base)
βββ server/
β βββ app.py # FastAPI + interactive dashboard + /grade
β βββ glucorl_environment.py # Core: reset/step/state with all 8 enhancements
β βββ patient_manager.py # simglucose wrapper + CGM noise + exercise
β βββ reward_calculator.py # Shaped reward + recovery bonus
β βββ graders.py # 4 task graders + grade_detailed()
β βββ pid_controller.py # PID baseline with anti-windup
β βββ constants.py # Thresholds, PK/PD, meals, exercise, illness
βββ tests/ # 120 tests (environment, graders, reward)
- simglucose β FDA-accepted UVa/Padova T1D simulator by Jinyu Xie
- OpenEnv β Open environment specification by Meta PyTorch
- UVa/Padova Model β Kovatchev et al., Journal of Diabetes Science and Technology, 2009
- Insulin PK/PD β Gamma-CDF absorption model based on Hovorka et al., 2004