Skip to content

fix(docker): add graceful shutdown handler to prevent pg0 data loss on restart#698

Open
kagura-agent wants to merge 2 commits intovectorize-io:mainfrom
kagura-agent:fix/graceful-shutdown-675
Open

fix(docker): add graceful shutdown handler to prevent pg0 data loss on restart#698
kagura-agent wants to merge 2 commits intovectorize-io:mainfrom
kagura-agent:fix/graceful-shutdown-675

Conversation

@kagura-agent
Copy link
Contributor

Problem

When running Hindsight as a Docker container, docker restart can cause the embedded pg0 (PostgreSQL) database to lose all data. The start-all.sh entrypoint script does not handle SIGTERM — when Docker sends the shutdown signal, child processes (including pg0) are killed abruptly without a clean shutdown. This prevents PostgreSQL from writing a final checkpoint and flushing WAL, which can result in data loss when the volume is remounted after restart.

Changes

1. Graceful shutdown handler

  • Added trap cleanup SIGTERM SIGINT to start-all.sh
  • The cleanup function forwards SIGTERM to all tracked child PIDs
  • Waits up to 30 seconds for processes to exit cleanly (pg0 needs time to flush WAL)
  • Force-kills any remaining processes after timeout

2. Startup data integrity check

3. Improved wait loop

  • Replaced bare wait -n with a loop that detects child exits and triggers cleanup for remaining services
  • Previous behavior: if one service crashed, the other kept running orphaned

Testing

  • Verified trap syntax compatibility with bash 5.x (Docker default)
  • Confirmed kill -0 PID checks work with backgrounded processes
  • The cleanup function is idempotent (safe to call multiple times)

Fixes #675

…n restart (vectorize-io#675)

- Trap SIGTERM/SIGINT in start-all.sh to forward signals to child processes
- pg0 (embedded PostgreSQL) now gets a clean shutdown with WAL flush
- 30-second timeout before force-killing unresponsive processes
- Add startup data integrity check: warn if pg0 data dir exists but PG_VERSION missing
- Improve wait loop robustness: trigger cleanup when any child exits unexpectedly

Fixes vectorize-io#675
Copy link
Collaborator

@nicoloboschi nicoloboschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good PR — solves a real problem cleanly. A few things to consider:

1. Docker stop timeout vs cleanup timeout mismatch
The cleanup waits up to 30s, but Docker's default stop_grace_period is 10s. Docker will SIGKILL the container before cleanup finishes. Either:

  • Document that users should set stop_grace_period: 30s in their compose file (or docker stop -t 30)
  • Or reduce the timeout to ~8s to fit within Docker's default

This is the most important callout — without it, the fix may not actually help in the default Docker configuration.

2. Re-entrant cleanup
If a child crashes (triggering cleanup from the wait loop) and Docker simultaneously sends SIGTERM (triggering the trap), cleanup runs twice concurrently. Consider adding a guard:

SHUTTING_DOWN=false
cleanup() {
    if $SHUTTING_DOWN; then return; fi
    SHUTTING_DOWN=true
    ...
}

3. wait -n && true deserves a comment
The && true is there to prevent set -e from killing the script when wait -n returns non-zero (child exited with error or no children left). Worth a one-line comment since it looks like a no-op otherwise.

4. Minor: PIDS array declared after trap
The trap references ${PIDS[@]} but the array is empty until processes start below. If a signal arrives between trap setup and process launch, cleanup runs on an empty array and exits 0. Harmless, but a comment noting this would help future readers.

5. Minor: find in integrity check
pg0 puts data at a predictable path — a glob like [ -f "$PG0_DATA_DIR"/*/PG_VERSION ] would avoid the subprocess. Not a big deal though.

Overall: solid fix, merge-worthy once the Docker timeout caveat is addressed (either in code or docs).

…r glob

- Add SHUTTING_DOWN guard to prevent concurrent cleanup runs
- Document Docker stop_grace_period mismatch (30s cleanup vs 10s default)
- Replace find subprocess with compgen glob for PG_VERSION check
- Add comment explaining wait -n && true idiom
@kagura-agent
Copy link
Contributor Author

Thanks for the thorough review @nicoloboschi! Pushed a commit addressing all points:

1. Docker stop timeout mismatch — Great catch, this is the key usability issue. Added a prominent comment in the cleanup function documenting the mismatch and telling users to set stop_grace_period: 30s or docker stop -t 30. I kept the 30s cleanup timeout rather than reducing to 8s, since pg0 with a large WAL backlog genuinely benefits from the longer window, and the fix should "just work" when the user follows the documented config.

2. Re-entrant cleanup — Added the SHUTTING_DOWN guard, exactly as suggested.

3. wait -n && true comment — Added a comment explaining the idiom.

4. PIDS/trap ordering — Added a note. The guard from (2) also makes this safe — if a signal hits before any PIDs exist, cleanup returns immediately since there is nothing to stop.

5. find → glob — Replaced with compgen -G "$PG0_DATA_DIR"/*/PG_VERSION to avoid the subprocess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Embedded pg0 database loses all data after docker restart (Docker Desktop Extension context)

2 participants