Skip to content

fix: prevent simulation destruction on page refresh + add dashboard view#30

Open
hassanpasha5630 wants to merge 7 commits intonikmcfly:mainfrom
hassanpasha5630:fix/simulation-lifecycle-and-dashboard
Open

fix: prevent simulation destruction on page refresh + add dashboard view#30
hassanpasha5630 wants to merge 7 commits intonikmcfly:mainfrom
hassanpasha5630:fix/simulation-lifecycle-and-dashboard

Conversation

@hassanpasha5630
Copy link
Copy Markdown

@hassanpasha5630 hassanpasha5630 commented Apr 3, 2026

Summary

While self-hosting MiroFish-Offline on a 4x GPU Linux server, we discovered that refreshing the browser kills the running simulation and permanently deletes all data. This PR fixes the root cause (two compounding bugs) plus related issues found during debugging, adds a dashboard view to monitor simulations, and fixes orphaned simulation recovery on backend restart.

The kill chain (before this PR)

  1. Step3Simulation.vue onMounted() calls doStartSimulation() unconditionally on every page load
  2. doStartSimulation() has force: true hardcoded, telling the backend to kill any running process
  3. Backend kills the simulation subprocess, deletes all runtime files (action logs, SQLite DBs, run state)
  4. Backend starts a new simulation from round 0
  5. All previous progress is permanently lost

Bug fixes

  • Frontend: check before startingonMounted now calls getRunStatus() first; if the simulation is already running, it resumes status polling instead of restarting (Step3Simulation.vue)
  • Frontend: don't force by default — changed force: true to force: false in the default start path (Step3Simulation.vue)
  • Backend: pass GraphStorage to subprocess — the /start endpoint was calling SimulationRunner.start_simulation() without the storage parameter, so GraphMemoryUpdater threw "Must provide storage" and graph updates silently failed. Now fetches neo4j_storage from Flask context and passes it through (simulation.py)
  • Backend: preserve state.json on "already running" — when /start returns 400 for an already-running simulation, state.json is now saved with status: "running" before returning the error, preventing a desync where the frontend sees "ready" and retries in a loop (simulation.py)
  • Backend: clear Neo4j on force-restartcleanup_simulation_logs() now accepts optional storage and graph_id parameters to clear stale graph data during force-restart (simulation_runner.py)
  • Backend: reconnect to orphaned simulations on restart — when the backend restarts (crash, debug auto-reload, manual restart), the monitor thread that updates run_state.json dies while the simulation subprocess keeps running. On startup, SimulationRunner.reconnect_orphaned_simulations() now scans for simulations with runner_status="running", checks if the PID is still alive, and starts a new monitor thread to resume reading action logs and updating state. Dead processes are marked as stopped. (simulation_runner.py, __init__.py)
  • Backend: reconnect GraphMemoryUpdater for orphaned simulations — the orphan reconnect was recovering the monitor thread but not the GraphMemoryUpdater, so graph updates stopped after any backend restart. Now reads graph_id from state.json, creates a fresh Neo4jStorage connection, and restarts the updater so simulation actions continue flowing into the knowledge graph. (simulation_runner.py)
  • Backend: fix duplicate graph episodes on reconnect — the orphan monitor started reading action logs from position 0 on every backend restart, re-feeding all actions to GraphMemoryUpdater and creating duplicate episodes in Neo4j. Now starts from end of existing log files so only new actions are processed. (simulation_runner.py)

Additional fixes (found during debugging)

  • Bumped default LLM context window from 8192 to 32768 tokens (llm_client.py)
  • Added traceback logging for ontology generation failures (graph.py)
  • Filter malformed entity/edge types missing the name key before validation (ontology_generator.py)

New feature: Simulation Dashboard (/dashboard)

Built to verify that the bug fixes were working — turned out to be a useful feature:

  • Active Now section with live-updating cards for running simulations (progress bars, round counts, stop/view actions, 3s polling)
  • All Simulations history table with search input and status filter tabs (All / Running / Completed / Stopped)
  • Navigation to Graph, Simulation Run, and Report views per simulation
  • Matches existing design language (Space Grotesk, JetBrains Mono, custom CSS, no new dependencies)
  • Link added to Home navbar

Test plan

  • Start a simulation, refresh the browser — simulation continues running (not killed)
  • Dashboard shows running simulation with live progress updates
  • GraphMemoryUpdater writes to Neo4j during simulation (verified: 51 entities, 20 relations, 25 episodes)
  • Calling /start with force: false on a running simulation returns 400 without killing it
  • Restart backend while simulation is running — backend reconnects and resumes monitoring automatically
  • Restart backend — GraphMemoryUpdater reconnects and graph updates resume
  • Restart backend multiple times — no duplicate graph episodes created
  • Dashboard filter tabs and search work correctly
  • Navigation from dashboard cards to simulation/report views works

Browser refresh was killing running simulations due to two compounding bugs:
the frontend unconditionally called /start on mount with force=true hardcoded,
nuking the running process and deleting all data files every time.

Bug fixes:
- Frontend: check run-status before starting; only start if not already running
- Frontend: change force flag from true to false in default start path
- Backend: pass GraphStorage to SimulationRunner.start_simulation() so graph
  memory updates actually work (was failing silently with "Must provide storage")
- Backend: preserve state.json as "running" when /start returns 400 for
  already-running sim (prevents frontend retry loop from state desync)
- Backend: clear Neo4j graph data during force-restart cleanup (was leaving
  stale nodes/edges from previous runs)

Additional fixes applied during debugging:
- Bump default LLM context window from 8192 to 32768 tokens
- Add traceback logging for ontology generation failures
- Filter malformed entity/edge types missing 'name' key

New feature — simulation dashboard (/dashboard):
- "Active Now" section with live-updating cards for running simulations
  (progress bars, round counts, stop/view actions, 3s polling)
- "All Simulations" history table with search and status filter tabs
- Added to help verify the bug fixes were working correctly

Made-with: Cursor
When the backend restarts (crash, manual restart, debug auto-reload),
the monitor thread that reads action logs and updates run_state.json
dies, but the simulation subprocess survives. This left simulations
in a "running" state with stale progress data.

On startup, SimulationRunner now scans for run_state.json files with
runner_status="running", checks if the PID is still alive, and starts
a lightweight monitor thread that reads action logs and updates state.
Dead processes are marked as stopped.

Made-with: Cursor
The orphan reconnect was recovering the monitor thread (for run_state
updates) but not the GraphMemoryUpdater, so graph updates stopped after
any backend restart. Now reads graph_id from state.json, creates a fresh
Neo4jStorage connection, and restarts the updater so simulation actions
continue flowing into the knowledge graph.

Made-with: Cursor
- vite.config.js: host 0.0.0.0, port 3001, ngrok allowedHosts
- api/index.js: empty baseURL for proxy-based deployment
- future_features.md: proprietary feature roadmap (event injection,
  simulation resume, narrative stacking, comparative runs)

Made-with: Cursor
onMounted only checked for runner_status='running'. When a completed
simulation was visited, it tried to restart it (got 400 rejected) and
showed a blank "WAITING FOR AGENT ACTIONS" state. Now handles completed
and stopped states by loading results directly (phase 2).

Made-with: Cursor
When visiting a completed simulation, the action feed showed "WAITING
FOR AGENT ACTIONS" because fetchRunStatusDetail() was never called.
Now fetches detail data once on load so the action timeline populates.

Made-with: Cursor
_monitor_orphaned_simulation started reading action logs from position 0
on every backend restart, re-feeding all actions to GraphMemoryUpdater
and creating duplicate episodes. Now starts from end of existing log
files so only new actions are processed.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant