Skip to content

Conversation

@rohanKanojia
Copy link
Member

@rohanKanojia rohanKanojia commented Oct 8, 2025

What does this PR do?

In #425 controller.devfile.io/debug-start annotation was added to aid in debugging failed devworkspaces: Debugging a failing workspace

We extend the use case of this annotation so that any failure in a postStart command results in the container sleeping for a specified number of seconds, as per the configured progressTimeout, allowing developers time to inspect the container state.

  • Added enableDebugStart parameter to poststart methods.
  • Injects trap ... sleep into postStart scripts when debug mode is enabled.
  • Includes support for both timeout-wrapped (postStartTimeout) and non-timeout lifecycle scripts.

This feature improves the debuggability of DevWorkspaces where postStart hooks fail and would otherwise cause container crashes/restarts.

What issues does this PR fix or reference?

eclipse-che/che#23404

Is it tested? How?

With Changes

  1. Checkout code changes added in this PR
  2. Deploy DevWorkspace Operator Kubernetes/OpenShift cluster make docker && make install
  3. Create DevWorkspace that has a failing poststart command
oc apply -f - <<EOF
apiVersion: workspace.devfile.io/v1alpha2
kind: DevWorkspace
metadata:
  name: failing-poststart-debug-dw
  annotations:
    controller.devfile.io/debug-start: "true"
spec:
  started: true
  template:
    components:
      - name: tools
        container:
          image: quay.io/wto/web-terminal-tooling:next
          sourceMapping: /projects
          command: [ "tail" ]
          args: [ "-f", "/dev/null" ]
    commands:
      - id: failing-command
        exec:
          commandLine: ls idontexist
          component: tools
    events:
      postStart:
        - failing-command
EOF
  1. After creating the DevWorkspace, observe its pod status. It should stay in ContainerCreating phase
oc get dw                                                                           
NAME                         DEVWORKSPACE ID             PHASE      INFO
failing-poststart-debug-dw   workspace55bf350cfb754260   Starting   Waiting for workspace deployment
oc get pods                                                                          
NAME                                         READY   STATUS              RESTARTS   AGE
workspace55bf350cfb754260-54749bf7c5-288vt   0/1     ContainerCreating   0          10s
  1. You should be able to exec into the pod and see /tmp/poststart-stderr.txt to see root cause of failure:
oc get pods                                                                         
NAME                                         READY   STATUS              RESTARTS   AGE
workspace55bf350cfb754260-54749bf7c5-288vt   0/1     ContainerCreating   0          14s
kubectl exec -it workspace55bf350cfb754260-54749bf7c5-288vt -- /bin/bash            
bash-4.4$ cat /tmp/poststart-stderr.txt 
ls: cannot access 'idontexist': No such file or directory
  1. Verify the sleep process is active in the container:
ps -ax | grep sleep
      2 ?        Ss     0:00 /bin/sh -c { cat << 'EOF' > /tmp/poststart.sh #!/bin/sh set -e trap 'echo "[postStart] failure encountered, sleep for debugging"; sleep 3600' ERR ls idontexist EOF chmod +x /tmp/poststart.sh /tmp/poststart.sh  } 1>/tmp/poststart-stdout.txt 2>/tmp/poststart-stderr.txt 
      7 ?        S      0:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 3600
     19 pts/0    S+     0:00 grep sleep
With PostStartTimeout Enabled

PostStart commands are processed slightly differently when postStartTimeout field is enabled in DevWorkspaceOperatorConfig. You can verify the above flow after enabling it:

oc patch devworkspaceoperatorconfig devworkspace-operator-config -n openshift-operators --type=merge -p '{"config": {"workspace": {"postStartTimeout": "5m"}}}'

## Repeat steps 3-6

Without Changes

With the current changes in the main, when we create a DevWorkspace with a failing post-start event. The pod immediately goes into PostStartHookFailed error and then CrashLoopbackOff error. It doesn't allow execution into it to view failure:

# Create DevWorkspace with failing poststart
oc apply -f - <<EOF
apiVersion: workspace.devfile.io/v1alpha2
kind: DevWorkspace
metadata:
  name: failing-poststart-debug-dw
  annotations:
    controller.devfile.io/debug-start: "true"
spec:
  started: true
  template:
    components:
      - name: tools
        container:
          image: quay.io/wto/web-terminal-tooling:next
          sourceMapping: /projects
          command: [ "tail" ]
          args: [ "-f", "/dev/null" ]
    commands:
      - id: failing-command
        exec:
          commandLine: ls idontexist
          component: tools
    events:
      postStart:
        - failing-command
EOF

oc get dw                                                                            
NAME                         DEVWORKSPACE ID             PHASE     INFO
failing-poststart-debug-dw   workspace7ad9a94285b94f7c   Failing   Error creating DevWorkspace deployment: Container tools has state [postStart hook] Commands failed (Kubelet reported exit code 2)

oc get pods                                                                          
NAME                                         READY   STATUS             RESTARTS      AGE
workspace7ad9a94285b94f7c-579896cc48-wmtrj   0/1     CrashLoopBackOff   1 (20s ago)   50s
kubectl exec -it workspace7ad9a94285b94f7c-579896cc48-wmtrj -- /bin/bash             
error: unable to upgrade connection: container not found ("tools")

PR Checklist

  • E2E tests pass (when PR is ready, comment /test v8-devworkspace-operator-e2e, v8-che-happy-path to trigger)
    • v8-devworkspace-operator-e2e: DevWorkspace e2e test
    • v8-che-happy-path: Happy path for verification integration with Che

@openshift-ci
Copy link

openshift-ci bot commented Oct 8, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link

openshift-ci bot commented Oct 8, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rohanKanojia
Once this PR has been reviewed and has the lgtm label, please assign dkwon17 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rohanKanojia rohanKanojia force-pushed the pr/debug-poststart-via-trap branch 3 times, most recently from 5e1a317 to bfa0ab7 Compare October 8, 2025 18:44
@rohanKanojia
Copy link
Member Author

/ok-to-test

@rohanKanojia rohanKanojia force-pushed the pr/debug-poststart-via-trap branch from bfa0ab7 to ea21eb5 Compare October 9, 2025 03:50
@rohanKanojia
Copy link
Member Author

/ok-to-test

@rohanKanojia rohanKanojia force-pushed the pr/debug-poststart-via-trap branch from ea21eb5 to 9559169 Compare October 9, 2025 09:22
@rohanKanojia
Copy link
Member Author

/ok-to-test

@rohanKanojia
Copy link
Member Author

/ok-to-test

@rohanKanojia rohanKanojia force-pushed the pr/debug-poststart-via-trap branch from 542c5ff to 9853a13 Compare October 16, 2025 09:14
@rohanKanojia
Copy link
Member Author

/ok-to-test

@rohanKanojia rohanKanojia force-pushed the pr/debug-poststart-via-trap branch from 9853a13 to 605efe4 Compare October 16, 2025 11:52
@rohanKanojia
Copy link
Member Author

/ok-to-test

@rohanKanojia rohanKanojia force-pushed the pr/debug-poststart-via-trap branch from 605efe4 to ff5a0d9 Compare October 16, 2025 15:37
@rohanKanojia
Copy link
Member Author

/ok-to-test

@rohanKanojia rohanKanojia marked this pull request as ready for review October 16, 2025 15:54
@tolusha
Copy link
Contributor

tolusha commented Oct 23, 2025

For some reasons my workspace is running (tested on OpenShift)

oc get dw    -A
NAMESPACE   NAME                         DEVWORKSPACE ID             PHASE     INFO
test        failing-poststart-debug-dw   workspaced0882b8ed1fc4c69   Running   Workspace is running


postStartDebugTrapSleepDuration := ""
if workspace.Annotations[constants.DevWorkspaceDebugStartAnnotation] == "true" {
postStartDebugTrapSleepDuration = workspace.Config.Workspace.ProgressTimeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I don't like using ProgressTimeout for this purpose.
But on the other hand I don't have another solution but some constant

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I use ProgressTimeout to be consistent with the behavior of the Debug annotation when it fails for the main component.

We do not scale down the failing workspace until the failing timeout is satisfied:

// If debug annotation is present, leave the deployment in place to let users
// view logs.
if workspace.Annotations[constants.DevWorkspaceDebugStartAnnotation] == "true" {
if isTimeout, err := checkForFailingTimeout(workspace); err != nil {

Inside the checkForFailingTimeout, we're parsing ProgressTimeout:

timeout, err := time.ParseDuration(workspace.Config.Workspace.ProgressTimeout)

#!/bin/sh
%s
EOF
chmod +x /tmp/poststart.sh
Copy link
Contributor

@tolusha tolusha Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohanKanojia
Were you able to test this snippet?
I am not sure if chmod +x will work

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really apologize for my mistake 🙏 . This seems to be a leftover from previous attempts.

I'll remove it.

@rohanKanojia
Copy link
Member Author

For some reasons my workspace is running (tested on OpenShift)

@tolusha : Could you please share which OCP version you were using? I have tested it on CRC 2.53.0 with OpenShift 4.19.3.

@rohanKanojia rohanKanojia force-pushed the pr/debug-poststart-via-trap branch from ff5a0d9 to 67661ea Compare October 25, 2025 10:45
@rohanKanojia
Copy link
Member Author

@tolusha : I've created these videos based on OpenShift 4.20 via clusterbot

Scenario 1 : No Poststart Timeout Configured

dwo-debug-poststart-normal-scenario.mp4

Scenario 2: PostStart Timeout Configured

dwo-debug-poststart-poststart-timeout-configured.mp4

@rohanKanojia rohanKanojia force-pushed the pr/debug-poststart-via-trap branch from 67661ea to 99ad59e Compare October 25, 2025 11:43
@tolusha
Copy link
Contributor

tolusha commented Oct 27, 2025

There is a corner case.
When trap already exists, then added one is ignored.
I think we can keep as is.

…rapping errors

Add an optional debug mechanism for postStart lifecycle hooks. When enabled via the
`controller.devfile.io/debug-start: "true"` annotation, any failure in a postStart command results in the container sleeping for some seconds as per configured progressTimeout, allowing developers time to inspect the container state.

- Added `enableDebugStart` parameter to poststart methods.
- Injects `trap ... sleep` into postStart scripts when debug mode is enabled.
- Includes support for both timeout-wrapped (`postStartTimeout`) and non-timeout lifecycle scripts.

This feature improves debuggability of DevWorkspaces where postStart hooks fail and would otherwise cause container crash/restarts.

Signed-off-by: Rohan Kumar <[email protected]>
@rohanKanojia rohanKanojia force-pushed the pr/debug-poststart-via-trap branch from 99ad59e to 58c9221 Compare October 29, 2025 04:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants