Skip to content

Commit 6da0e77

Browse files
committed
fix(bootstrap,server): persist sandbox state across gateway stop/start cycles
Two changes to preserve sandbox state across gateway restarts: 1. Deterministic k3s node identity: Set the Docker container hostname to a deterministic name derived from the gateway name (openshell-{name}). Pass OPENSHELL_NODE_NAME env var and --node-name flag to k3s via the cluster entrypoint as belt-and-suspenders. Update clean_stale_nodes() to prefer the deterministic name with a fallback to the container hostname for backward compatibility with older cluster images. This prevents clean_stale_nodes() from deleting PVCs (including the server's SQLite database) when the container is recreated after an image upgrade. 2. Default workspace persistence: Inject a 2Gi PVC and init container into every sandbox pod so the /sandbox directory survives pod rescheduling. The init container uses the same sandbox image, mounts the PVC at a temporary path, and copies the image's /sandbox contents (Python venv, dotfiles, skills) into the PVC on first use — guarded by a sentinel file so subsequent restarts are instant. The agent container then mounts the populated PVC at /sandbox. Users who supply custom volumeClaimTemplates are unaffected — the default workspace is skipped. Fixes #738
1 parent dd8dd8a commit 6da0e77

File tree

7 files changed

+470
-17
lines changed

7 files changed

+470
-17
lines changed

architecture/gateway-single-node.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -185,7 +185,7 @@ For the target daemon (local or remote):
185185

186186
After the container starts:
187187

188-
1. **Clean stale nodes**: `clean_stale_nodes()` finds `NotReady` nodes via `kubectl get nodes` and deletes them. This is needed when a container is recreated but reuses the persistent volume -- k3s registers a new node (using the container ID as hostname) while old node entries persist in etcd. Non-fatal on error; returns the count of removed nodes.
188+
1. **Clean stale nodes**: `clean_stale_nodes()` finds nodes whose name does not match the deterministic k3s `--node-name` and deletes them. That node name is derived from the gateway name but normalized to a Kubernetes-safe lowercase form so existing gateway names that contain `_`, `.`, or uppercase characters still produce a valid node identity. This cleanup is needed when a container is recreated but reuses the persistent volume -- old node entries can persist in etcd. Non-fatal on error; returns the count of removed nodes.
189189
2. **Push local images** (optional, local deploy only): If `OPENSHELL_PUSH_IMAGES` is set, the comma-separated image refs are exported from the local Docker daemon as a single tar, uploaded into the container via `docker put_archive`, and imported into containerd via `ctr images import` in the `k8s.io` namespace. After import, `kubectl rollout restart deployment/openshell openshell` is run, followed by `kubectl rollout status --timeout=180s` to wait for completion. See `crates/openshell-bootstrap/src/push.rs`.
190190
3. **Wait for gateway health**: `wait_for_gateway_ready()` polls the Docker HEALTHCHECK status up to 180 times, 2 seconds apart (6 min total). A background task streams container logs during this wait. Failure modes:
191191
- Container exits during polling: error includes recent log lines.

architecture/gateway.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -501,7 +501,7 @@ The Helm chart template is at `deploy/helm/openshell/templates/statefulset.yaml`
501501

502502
`SandboxClient` (`crates/openshell-server/src/sandbox/mod.rs`) manages `agents.x-k8s.io/v1alpha1/Sandbox` CRDs.
503503

504-
- **Create**: Translates a `Sandbox` proto into a Kubernetes `DynamicObject` with labels (`openshell.ai/sandbox-id`, `openshell.ai/managed-by: openshell`) and a spec that includes the pod template, environment variables, and gateway-required env vars (`OPENSHELL_SANDBOX_ID`, `OPENSHELL_ENDPOINT`, `OPENSHELL_SSH_LISTEN_ADDR`, etc.).
504+
- **Create**: Translates a `Sandbox` proto into a Kubernetes `DynamicObject` with labels (`openshell.ai/sandbox-id`, `openshell.ai/managed-by: openshell`) and a spec that includes the pod template, environment variables, and gateway-required env vars (`OPENSHELL_SANDBOX_ID`, `OPENSHELL_ENDPOINT`, `OPENSHELL_SSH_LISTEN_ADDR`, etc.). When callers do not provide custom `volumeClaimTemplates`, the server injects a default `workspace` PVC and mounts it at `/sandbox` so the default sandbox home/workdir survives pod rescheduling.
505505
- **Delete**: Calls the Kubernetes API to delete the CRD by name. Returns `false` if already gone (404).
506506
- **Pod IP resolution**: `agent_pod_ip()` fetches the agent pod and reads `status.podIP`.
507507

crates/openshell-bootstrap/src/constants.rs

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,15 +13,100 @@ pub const SERVER_CLIENT_CA_SECRET_NAME: &str = "openshell-server-client-ca";
1313
pub const CLIENT_TLS_SECRET_NAME: &str = "openshell-client-tls";
1414
/// K8s secret holding the SSH handshake HMAC secret (shared by gateway and sandbox pods).
1515
pub const SSH_HANDSHAKE_SECRET_NAME: &str = "openshell-ssh-handshake";
16+
const NODE_NAME_PREFIX: &str = "openshell-";
17+
const NODE_NAME_FALLBACK_SUFFIX: &str = "gateway";
18+
const KUBERNETES_MAX_NAME_LEN: usize = 253;
1619

1720
pub fn container_name(name: &str) -> String {
1821
format!("openshell-cluster-{name}")
1922
}
2023

24+
/// Deterministic k3s node name derived from the gateway name.
25+
///
26+
/// k3s defaults to using the container hostname (= Docker container ID) as
27+
/// the node name. When the container is recreated (e.g. after an image
28+
/// upgrade), the container ID changes, creating a new k3s node. The
29+
/// `clean_stale_nodes` function then deletes PVCs whose backing PVs have
30+
/// node affinity for the old node — wiping the server database and any
31+
/// sandbox persistent volumes.
32+
///
33+
/// By passing a deterministic `--node-name` to k3s, the node identity
34+
/// survives container recreation, and PVCs are never orphaned.
35+
///
36+
/// Gateway names allow Docker-friendly separators and uppercase characters,
37+
/// but Kubernetes node names must be DNS-safe. Normalize the gateway name into
38+
/// a single lowercase RFC 1123 label so previously accepted names such as
39+
/// `prod_us` or `Prod.US` still deploy successfully.
40+
pub fn node_name(name: &str) -> String {
41+
format!("{NODE_NAME_PREFIX}{}", normalize_node_name_suffix(name))
42+
}
43+
44+
fn normalize_node_name_suffix(name: &str) -> String {
45+
let mut normalized = String::with_capacity(name.len());
46+
let mut last_was_separator = false;
47+
48+
for ch in name.chars() {
49+
if ch.is_ascii_alphanumeric() {
50+
normalized.push(ch.to_ascii_lowercase());
51+
last_was_separator = false;
52+
} else if !last_was_separator {
53+
normalized.push('-');
54+
last_was_separator = true;
55+
}
56+
}
57+
58+
let mut normalized = normalized.trim_matches('-').to_string();
59+
if normalized.is_empty() {
60+
normalized.push_str(NODE_NAME_FALLBACK_SUFFIX);
61+
}
62+
63+
let max_suffix_len = KUBERNETES_MAX_NAME_LEN.saturating_sub(NODE_NAME_PREFIX.len());
64+
if normalized.len() > max_suffix_len {
65+
normalized.truncate(max_suffix_len);
66+
normalized.truncate(normalized.trim_end_matches('-').len());
67+
}
68+
69+
if normalized.is_empty() {
70+
normalized.push_str(NODE_NAME_FALLBACK_SUFFIX);
71+
}
72+
73+
normalized
74+
}
75+
2176
pub fn volume_name(name: &str) -> String {
2277
format!("openshell-cluster-{name}")
2378
}
2479

2580
pub fn network_name(name: &str) -> String {
2681
format!("openshell-cluster-{name}")
2782
}
83+
84+
#[cfg(test)]
85+
mod tests {
86+
use super::*;
87+
88+
#[test]
89+
fn node_name_normalizes_uppercase_and_underscores() {
90+
assert_eq!(node_name("Prod_US"), "openshell-prod-us");
91+
}
92+
93+
#[test]
94+
fn node_name_collapses_and_trims_separator_runs() {
95+
assert_eq!(node_name("._Prod..__-Gateway-."), "openshell-prod-gateway");
96+
}
97+
98+
#[test]
99+
fn node_name_falls_back_when_gateway_name_has_no_alphanumerics() {
100+
assert_eq!(node_name("...___---"), "openshell-gateway");
101+
}
102+
103+
#[test]
104+
fn node_name_truncates_to_kubernetes_name_limit() {
105+
let gateway_name = "A".repeat(400);
106+
let node_name = node_name(&gateway_name);
107+
108+
assert!(node_name.len() <= KUBERNETES_MAX_NAME_LEN);
109+
assert!(node_name.starts_with(NODE_NAME_PREFIX));
110+
assert!(node_name.ends_with('a'));
111+
}
112+
}

crates/openshell-bootstrap/src/docker.rs

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
// SPDX-License-Identifier: Apache-2.0
33

44
use crate::RemoteOptions;
5-
use crate::constants::{container_name, network_name, volume_name};
5+
use crate::constants::{container_name, network_name, node_name, volume_name};
66
use crate::image::{self, DEFAULT_IMAGE_REPO_BASE, DEFAULT_REGISTRY, parse_image_ref};
77
use bollard::API_DEFAULT_VERSION;
88
use bollard::Docker;
@@ -684,6 +684,11 @@ pub async fn ensure_container(
684684
format!("REGISTRY_HOST={registry_host}"),
685685
format!("REGISTRY_INSECURE={registry_insecure}"),
686686
format!("IMAGE_REPO_BASE={image_repo_base}"),
687+
// Deterministic k3s node name so the node identity survives container
688+
// recreation (e.g. after an image upgrade). Without this, k3s uses
689+
// the container ID as the hostname/node name, which changes on every
690+
// container recreate and triggers stale-node PVC cleanup.
691+
format!("OPENSHELL_NODE_NAME={}", node_name(name)),
687692
];
688693
if let Some(endpoint) = registry_endpoint {
689694
env_vars.push(format!("REGISTRY_ENDPOINT={endpoint}"));
@@ -753,6 +758,14 @@ pub async fn ensure_container(
753758

754759
let config = ContainerCreateBody {
755760
image: Some(image_ref.to_string()),
761+
// Set the container hostname to the deterministic node name.
762+
// k3s uses the container hostname as its default node name. Without
763+
// this, Docker defaults to the container ID (first 12 hex chars),
764+
// which changes on every container recreation and can cause
765+
// `clean_stale_nodes` to delete the wrong node on resume. The
766+
// hostname persists across container stop/start cycles, ensuring a
767+
// stable node identity.
768+
hostname: Some(node_name(name)),
756769
cmd: Some(cmd),
757770
env,
758771
exposed_ports: Some(exposed_ports),

crates/openshell-bootstrap/src/runtime.rs

Lines changed: 32 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
// SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
// SPDX-License-Identifier: Apache-2.0
33

4-
use crate::constants::{KUBECONFIG_PATH, container_name};
4+
use crate::constants::{KUBECONFIG_PATH, container_name, node_name};
55
use bollard::Docker;
66
use bollard::container::LogOutput;
77
use bollard::exec::CreateExecOptions;
@@ -385,11 +385,19 @@ pub async fn clean_stale_nodes(docker: &Docker, name: &str) -> Result<usize> {
385385
let container_name = container_name(name);
386386
let mut stale_nodes: Vec<String> = Vec::new();
387387

388+
// Determine the current node name. With the deterministic `--node-name`
389+
// entrypoint change the k3s node is `openshell-{gateway}`. However, older
390+
// cluster images (built before that change) still use the container hostname
391+
// (= Docker container ID) as the node name. We must handle both:
392+
//
393+
// 1. If the expected deterministic name appears in the node list, use it.
394+
// 2. Otherwise fall back to the container hostname (old behaviour).
395+
//
396+
// This ensures backward compatibility during upgrades where the bootstrap
397+
// CLI is newer than the cluster image.
398+
let deterministic_node = node_name(name);
399+
388400
for attempt in 1..=MAX_ATTEMPTS {
389-
// List ALL node names and the container's own hostname. Any node that
390-
// is not the current container is stale — we cannot rely on the Ready
391-
// condition because k3s may not have marked the old node NotReady yet
392-
// when this runs shortly after container start.
393401
let (output, exit_code) = exec_capture_with_exit(
394402
docker,
395403
&container_name,
@@ -406,16 +414,27 @@ pub async fn clean_stale_nodes(docker: &Docker, name: &str) -> Result<usize> {
406414
.await?;
407415

408416
if exit_code == 0 {
409-
// Determine the current node name (container hostname).
410-
let (hostname_out, _) =
411-
exec_capture_with_exit(docker, &container_name, vec!["hostname".to_string()])
412-
.await?;
413-
let current_hostname = hostname_out.trim().to_string();
414-
415-
stale_nodes = output
417+
let all_nodes: Vec<&str> = output
416418
.lines()
417419
.map(str::trim)
418-
.filter(|l| !l.is_empty() && *l != current_hostname)
420+
.filter(|l| !l.is_empty())
421+
.collect();
422+
423+
// Pick the current node identity: prefer the deterministic name,
424+
// fall back to the container hostname for older cluster images.
425+
let current_node = if all_nodes.contains(&deterministic_node.as_str()) {
426+
deterministic_node.clone()
427+
} else {
428+
// Older cluster image without --node-name: read hostname.
429+
let (hostname_out, _) =
430+
exec_capture_with_exit(docker, &container_name, vec!["hostname".to_string()])
431+
.await?;
432+
hostname_out.trim().to_string()
433+
};
434+
435+
stale_nodes = all_nodes
436+
.into_iter()
437+
.filter(|n| *n != current_node)
419438
.map(ToString::to_string)
420439
.collect();
421440
break;

0 commit comments

Comments
 (0)