fix(docker): add graceful shutdown handler to prevent pg0 data loss on restart#698
fix(docker): add graceful shutdown handler to prevent pg0 data loss on restart#698kagura-agent wants to merge 2 commits intovectorize-io:mainfrom
Conversation
…n restart (vectorize-io#675) - Trap SIGTERM/SIGINT in start-all.sh to forward signals to child processes - pg0 (embedded PostgreSQL) now gets a clean shutdown with WAL flush - 30-second timeout before force-killing unresponsive processes - Add startup data integrity check: warn if pg0 data dir exists but PG_VERSION missing - Improve wait loop robustness: trigger cleanup when any child exits unexpectedly Fixes vectorize-io#675
nicoloboschi
left a comment
There was a problem hiding this comment.
Good PR — solves a real problem cleanly. A few things to consider:
1. Docker stop timeout vs cleanup timeout mismatch
The cleanup waits up to 30s, but Docker's default stop_grace_period is 10s. Docker will SIGKILL the container before cleanup finishes. Either:
- Document that users should set
stop_grace_period: 30sin their compose file (ordocker stop -t 30) - Or reduce the timeout to ~8s to fit within Docker's default
This is the most important callout — without it, the fix may not actually help in the default Docker configuration.
2. Re-entrant cleanup
If a child crashes (triggering cleanup from the wait loop) and Docker simultaneously sends SIGTERM (triggering the trap), cleanup runs twice concurrently. Consider adding a guard:
SHUTTING_DOWN=false
cleanup() {
if $SHUTTING_DOWN; then return; fi
SHUTTING_DOWN=true
...
}3. wait -n && true deserves a comment
The && true is there to prevent set -e from killing the script when wait -n returns non-zero (child exited with error or no children left). Worth a one-line comment since it looks like a no-op otherwise.
4. Minor: PIDS array declared after trap
The trap references ${PIDS[@]} but the array is empty until processes start below. If a signal arrives between trap setup and process launch, cleanup runs on an empty array and exits 0. Harmless, but a comment noting this would help future readers.
5. Minor: find in integrity check
pg0 puts data at a predictable path — a glob like [ -f "$PG0_DATA_DIR"/*/PG_VERSION ] would avoid the subprocess. Not a big deal though.
Overall: solid fix, merge-worthy once the Docker timeout caveat is addressed (either in code or docs).
…r glob - Add SHUTTING_DOWN guard to prevent concurrent cleanup runs - Document Docker stop_grace_period mismatch (30s cleanup vs 10s default) - Replace find subprocess with compgen glob for PG_VERSION check - Add comment explaining wait -n && true idiom
|
Thanks for the thorough review @nicoloboschi! Pushed a commit addressing all points: 1. Docker stop timeout mismatch — Great catch, this is the key usability issue. Added a prominent comment in the cleanup function documenting the mismatch and telling users to set 2. Re-entrant cleanup — Added the 3. 4. PIDS/trap ordering — Added a note. The guard from (2) also makes this safe — if a signal hits before any PIDs exist, cleanup returns immediately since there is nothing to stop. 5. |
Problem
When running Hindsight as a Docker container,
docker restartcan cause the embedded pg0 (PostgreSQL) database to lose all data. Thestart-all.shentrypoint script does not handle SIGTERM — when Docker sends the shutdown signal, child processes (including pg0) are killed abruptly without a clean shutdown. This prevents PostgreSQL from writing a final checkpoint and flushing WAL, which can result in data loss when the volume is remounted after restart.Changes
1. Graceful shutdown handler
trap cleanup SIGTERM SIGINTtostart-all.shcleanupfunction forwards SIGTERM to all tracked child PIDs2. Startup data integrity check
~/.pg0) contains valid PostgreSQL data (looks forPG_VERSIONfile)PG_VERSIONis missing, logs a warning with a link to Bug: Embedded pg0 database loses all data after docker restart (Docker Desktop Extension context) #6753. Improved wait loop
wait -nwith a loop that detects child exits and triggers cleanup for remaining servicesTesting
trapsyntax compatibility with bash 5.x (Docker default)kill -0PID checks work with backgrounded processescleanupfunction is idempotent (safe to call multiple times)Fixes #675