fix(docker): add graceful shutdown handler to prevent pg0 data loss on restart by kagura-agent · Pull Request #698 · vectorize-io/hindsight

kagura-agent · 2026-03-26T02:47:05Z

Problem

When running Hindsight as a Docker container, docker restart can cause the embedded pg0 (PostgreSQL) database to lose all data. The start-all.sh entrypoint script does not handle SIGTERM — when Docker sends the shutdown signal, child processes (including pg0) are killed abruptly without a clean shutdown. This prevents PostgreSQL from writing a final checkpoint and flushing WAL, which can result in data loss when the volume is remounted after restart.

Changes

1. Graceful shutdown handler

Added trap cleanup SIGTERM SIGINT to start-all.sh
The cleanup function forwards SIGTERM to all tracked child PIDs
Waits up to 30 seconds for processes to exit cleanly (pg0 needs time to flush WAL)
Force-kills any remaining processes after timeout

2. Startup data integrity check

On startup, checks if the pg0 data directory (~/.pg0) contains valid PostgreSQL data (looks for PG_VERSION file)
If the directory exists but PG_VERSION is missing, logs a warning with a link to Bug: Embedded pg0 database loses all data after docker restart (Docker Desktop Extension context) #675
Helps users diagnose data loss scenarios (all migrations running from scratch = data was lost)

3. Improved wait loop

Replaced bare wait -n with a loop that detects child exits and triggers cleanup for remaining services
Previous behavior: if one service crashed, the other kept running orphaned

Testing

Verified trap syntax compatibility with bash 5.x (Docker default)
Confirmed kill -0 PID checks work with backgrounded processes
The cleanup function is idempotent (safe to call multiple times)

Fixes #675

…n restart (vectorize-io#675) - Trap SIGTERM/SIGINT in start-all.sh to forward signals to child processes - pg0 (embedded PostgreSQL) now gets a clean shutdown with WAL flush - 30-second timeout before force-killing unresponsive processes - Add startup data integrity check: warn if pg0 data dir exists but PG_VERSION missing - Improve wait loop robustness: trigger cleanup when any child exits unexpectedly Fixes vectorize-io#675

nicoloboschi

Good PR — solves a real problem cleanly. A few things to consider:

1. Docker stop timeout vs cleanup timeout mismatch
The cleanup waits up to 30s, but Docker's default stop_grace_period is 10s. Docker will SIGKILL the container before cleanup finishes. Either:

Document that users should set stop_grace_period: 30s in their compose file (or docker stop -t 30)
Or reduce the timeout to ~8s to fit within Docker's default

This is the most important callout — without it, the fix may not actually help in the default Docker configuration.

2. Re-entrant cleanup
If a child crashes (triggering cleanup from the wait loop) and Docker simultaneously sends SIGTERM (triggering the trap), cleanup runs twice concurrently. Consider adding a guard:

SHUTTING_DOWN=false
cleanup() {
    if $SHUTTING_DOWN; then return; fi
    SHUTTING_DOWN=true
    ...
}

3. wait -n && true deserves a comment
The && true is there to prevent set -e from killing the script when wait -n returns non-zero (child exited with error or no children left). Worth a one-line comment since it looks like a no-op otherwise.

4. Minor: PIDS array declared after trap
The trap references ${PIDS[@]} but the array is empty until processes start below. If a signal arrives between trap setup and process launch, cleanup runs on an empty array and exits 0. Harmless, but a comment noting this would help future readers.

5. Minor: find in integrity check
pg0 puts data at a predictable path — a glob like [ -f "$PG0_DATA_DIR"/*/PG_VERSION ] would avoid the subprocess. Not a big deal though.

Overall: solid fix, merge-worthy once the Docker timeout caveat is addressed (either in code or docs).

…r glob - Add SHUTTING_DOWN guard to prevent concurrent cleanup runs - Document Docker stop_grace_period mismatch (30s cleanup vs 10s default) - Replace find subprocess with compgen glob for PG_VERSION check - Add comment explaining wait -n && true idiom

kagura-agent · 2026-03-26T10:37:52Z

Thanks for the thorough review @nicoloboschi! Pushed a commit addressing all points:

1. Docker stop timeout mismatch — Great catch, this is the key usability issue. Added a prominent comment in the cleanup function documenting the mismatch and telling users to set stop_grace_period: 30s or docker stop -t 30. I kept the 30s cleanup timeout rather than reducing to 8s, since pg0 with a large WAL backlog genuinely benefits from the longer window, and the fix should "just work" when the user follows the documented config.

2. Re-entrant cleanup — Added the SHUTTING_DOWN guard, exactly as suggested.

3. wait -n && true comment — Added a comment explaining the idiom.

4. PIDS/trap ordering — Added a note. The guard from (2) also makes this safe — if a signal hits before any PIDs exist, cleanup returns immediately since there is nothing to stop.

5. find → glob — Replaced with compgen -G "$PG0_DATA_DIR"/*/PG_VERSION to avoid the subprocess.

nicoloboschi reviewed Mar 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(docker): add graceful shutdown handler to prevent pg0 data loss on restart#698

fix(docker): add graceful shutdown handler to prevent pg0 data loss on restart#698
kagura-agent wants to merge 2 commits intovectorize-io:mainfrom
kagura-agent:fix/graceful-shutdown-675

kagura-agent commented Mar 26, 2026

Uh oh!

nicoloboschi left a comment

Uh oh!

kagura-agent commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kagura-agent commented Mar 26, 2026

Problem

Changes

1. Graceful shutdown handler

2. Startup data integrity check

3. Improved wait loop

Testing

Uh oh!

nicoloboschi left a comment

Choose a reason for hiding this comment

Uh oh!

kagura-agent commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants