Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skaffold verify fails immediately without any useful logs #9587

Open
nathanperkins opened this issue Nov 28, 2024 · 5 comments · May be fixed by #9589
Open

skaffold verify fails immediately without any useful logs #9587

nathanperkins opened this issue Nov 28, 2024 · 5 comments · May be fixed by #9589

Comments

@nathanperkins
Copy link

nathanperkins commented Nov 28, 2024

Expected behavior

When using skaffold verify with the kubernetesCluster and the pod fails immediately, skaffold should give useful logs in the CLI or the Job and Pod should persist on the cluster so that they can be inspected.

I'd prefer to see the logs in the CLI, but if that is infeasible, it would be nice to have an option to keep skaffold from deleting the job and pod.

Actual behavior

Immediate skaffold verify failure results in no useful logs and there is no job or pod in the cluster. No logs are found in the GCP cloud logging console.

$ skaffold verify -a artifacts.txt
Tags used in verification:
 - <redacted> -> <redacted>:6489f2d-dirty
1 error(s) occurred:
* verify test failed: "<redacted>" running job "<redacted>" errored during run

Information

  • Skaffold version: v2.13.0
  • Operating system: Linux
  • Installed via: skaffold.dev
  • Contents of skaffold.yaml: (unfortunately proprietary)
  • K8s cluster: GKE
@idsulik
Copy link
Contributor

idsulik commented Nov 28, 2024

@nathanperkins hi! added some details within this PR #9589 , it should show fail reason and message now

@nathanperkins
Copy link
Author

nathanperkins commented Nov 28, 2024

@idsulik that's awesome, thanks for following up with a quick improvement, it will definitely help.

I'm not sure if the pod.status.message is going to be able to show why the pod is crashing in all cases, though. I'm able to briefly see in the GKE console that the job exited with code 128 before it's deleted. I'm pretty sure the error will be in the logs. I could be wrong though.

I might be able to catch it if I'm quick enough, but what would really help the most is if I was able to prevent the job and pod from being deleted so I can inspect them freely.

@idsulik
Copy link
Contributor

idsulik commented Nov 28, 2024

@nathanperkins , maybe you need this https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/#cleanup-for-finished-jobs to keep the pod?
if you want to save logs into pod's data, then try this one terminationMessagePolicy https://kubernetes.io/docs/tasks/debug/debug-application/determine-reason-pod-failure/

@nathanperkins
Copy link
Author

Thanks for your responses on this :)

maybe you need this to keep the pod?

Based on the logs, the pod is being force deleted by skaffold on failures, not by any K8s controller.

DEBU[0323] getting client config for kubeContext: `...`  subtask=-1 task=DevLoop
DEBU[0324] Running command: [kubectl --context ... delete pod --force --grace-period 0 --wait=true --selector job-name=foo-e2e-test]  subtask=-1 task=DevLoop
DEBU[0324] Command output: [pod "foo-e2e-test-2kkhs" force deleted
], stderr: Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.  subtask=-1 task=DevLoop

If you want to save logs into pod's data, then try this one terminationMessagePolicy

Great idea! For many cases, the last 2K of logs should be enough to determine the failure. But it seems like this is not supported in the verify yaml.

verify:
  - name: foo-e2e-test
    container:
      name: foo-e2e-test
      image: foo-e2e-test
      terminationMessagePolicy: FallbackToLogsOnError
    executionMode:
      kubernetesCluster: {}

Fails with:

parsing skaffold config: error parsing skaffold configuration file: unable to parse config: yaml: unmarshal errors:
  line 273: field terminationMessagePolicy not found in type latest.VerifyContainer

@nathanperkins
Copy link
Author

nathanperkins commented Dec 2, 2024

I see now why the logs weren't coming through in my case. kubelet was failing to start the container due to an image issue. AFAIK, that would have been obvious in the status reason so the enhancement you made would help a lot.

The terminationMessagePolicy field doesn't seem necessary at this moment since skaffold verify does correctly show the logs once the container is started.

Still, I think support for disabling deletion of the job / pod on failure would further improve the ability to debug issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants