Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[image builder] count failures and differentiate between system and user #15573

Open
kylos101 opened this issue Jan 4, 2023 · 11 comments
Open
Labels
component: image-builder meta: never-stale This issue can never become stale team: workspace Issue belongs to the Workspace team

Comments

@kylos101
Copy link
Contributor

kylos101 commented Jan 4, 2023

Is your feature request related to a problem? Please describe

Image builds fail, but, we don't know how many of those failures are related to the system, versus user input.

Describe the behaviour you'd like

  • when the system fails, count as a system failure, either as a label on an existing metric, and/or one or more new metrics. The intent is to grant us the ability to tell how many system failures fail image builds overall.
  • use the build output (which is on stderr) to identify user errors
    • create a set of parsing rules for known user errors. E.g. a RUN command in the Dockerfile exiting with an error. This would contain a log with >>> RUN:
      image

    • we can incrementally add to this list of parsing rules whenever we identify new user errors. For this, it would be useful for the rule list to be specified in config, so it becomes easy to update.

  • an image build could fail if we cannot authenticate with a private registry to pull the private base image, this should be considered a user failure (introduced in Support private registries #8550)
  • there are likely other paths that require instrumentation, ☝️ are just initial thoughts

Additional context

Related: #15572

@kylos101
Copy link
Contributor Author

kylos101 commented Jan 4, 2023

@sagor999 for 👀 and 🧑‍🔧

@akosyakov for 👀 and 💭

@kylos101 kylos101 moved this to Breakdown in 🌌 Workspace Team Jan 4, 2023
@sagor999
Copy link
Contributor

sagor999 commented Jan 6, 2023

This might not be easy to distinguish.
We do image build in here:


So we could potentially say that if that command returns error, then it is a user error.
But image build can fail for non user related reasons as well. For example, upstream docker hub issues, connectivity issues, registry facade issues, etc. They all might cause image build to fail here.

@kylos101
Copy link
Contributor Author

kylos101 commented Jan 9, 2023

Hmm, I see what you mean, the problem is the push is coupled to how we are configuring buildctl to run. We could output to a temporary TAR file, but would then need something to do the upload to our registry.

Our config:

"--output=type=image,name=" + target + ",push=true,oci-mediatypes=true",

We could output build to a tar file like in this answer:
https://stackoverflow.com/a/72001079

And then maybe add a bob push command, and use as part of bob proxy?

The benefit being, we could instrument the build separately from the push. If we make it to the build for the docker file, it's safer to say a failure is user related (maybe they fat fingered the dockerfile or the related dockerconfigjson on the Gitpod Project). wdy?

@sagor999
Copy link
Contributor

@kylos101 interesting option. But wouldn't this make this step quite a bit more complicated?
Especially that buildkit does not seem to have a push standalone command, which means we would need to do actual push via some other tool? 🤔
Is the benefit worth it and the future maintenance cost of that additional tool? 🤔

@kylos101
Copy link
Contributor Author

@sagor999 👋 , let me know what you think? Perhaps there's another way to differentiate between types of failures?

But wouldn't this make this step quite a bit more complicated?

Indeed. Aside from decoupling push from build, to better understand image build failure modes through instrumentation, how else can we differentiate between user and system failures?

Especially that buildkit does not seem to have a push standalone command, which means we would need to do actual push via some other tool?

We use skopeo now for workspace-images to push to GAR and Docker Hub, we could try using it here, too. In this case, it would need to support GCR and ECR.

Is the benefit worth it and the future maintenance cost of that additional tool? thinking

We would not have to maintain skopeo, I think.

I wrote:

And then maybe add a bob push command, and use as part of bob proxy?

Which in hindsight is overkill. Instead, we could add a new command, after buildctl produces the tar file, to upload then do an upload with skopeo.

@kylos101 kylos101 added the team: workspace Issue belongs to the Workspace team label Jan 10, 2023
@kylos101
Copy link
Contributor Author

@WVerlaek wdyt of separating build and push to help us differentiate between user and system failures? We would have to check and see if skopeo can push a tar file built by buildctl, and then see if the underlying image can be pulled and used.

Failure modes:

  • image-builder-mk3 fails creating imagebuild pod -> system
  • image-builder pod fails building -> depends
    • pull fail when ipfs or registry-facade fail - system
    • pull fail when the credentials do not work - user (like for a private registry)
  • image-buidler pod fails pushing -> system

If you're unsure too, we should reach out to @aledbf and @csweichel too for ideas. 😄

@WVerlaek
Copy link
Member

Had a chat with @kylos101 on this, I would also favour not splitting into a separate push step. It would still be hard to distinguish errors of the build step that can be both user/system errors.

We discussed an alternative solution: we could write parsing rules for known user-errors on stderr (e.g. a RUN command failing). We would probably be able to catch most common user errors by a small set of parsing rules, and can increase these rules over time if we discover new user failure modes. This would give us confidence that we do catch all system build errors. What do you think @sagor999?

@sagor999
Copy link
Contributor

Yeah, I think that might be an option. Especially if filter will be part of config file, so that we don't need to touch the code to make changes.

@WVerlaek
Copy link
Member

Updated the description. cc @kylos101

@kylos101
Copy link
Contributor Author

Woot! Love it. Moving to Scheduled.

Note: I think it might be hard to intercept the failure on docker pull from a private registry as a user failure? A good thing to test as part of the implementation. Maybe it'll fail on the FROM directive, in which case...add it to the list. 😄

@kylos101 kylos101 moved this from Breakdown to Scheduled in 🌌 Workspace Team Jan 13, 2023
@kylos101 kylos101 removed the status in 🌌 Workspace Team Feb 2, 2023
@stale
Copy link

stale bot commented May 9, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the meta: stale This issue/PR is stale and will be closed soon label May 9, 2023
@kylos101 kylos101 added meta: never-stale This issue can never become stale and removed meta: stale This issue/PR is stale and will be closed soon labels May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: image-builder meta: never-stale This issue can never become stale team: workspace Issue belongs to the Workspace team
Projects
No open projects
Status: No status
Development

No branches or pull requests

3 participants