Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting a devfile value for cpuLimit should not override the default value for cpuRequest #23245

Open
cgruver opened this issue Nov 12, 2024 · 4 comments
Labels
area/devworkspace-operator kind/bug Outline of a bug - must adhere to the bug report template. severity/P2 Has a minor but important impact to the usage or development of the system. status/analyzing An issue has been proposed and it is currently being analyzed for effort and implementation approach

Comments

@cgruver
Copy link

cgruver commented Nov 12, 2024

Describe the bug

As a developer, if I configure my devfile with cpuLimit, I expect that the default value for cpuRequest will be used, since I did not modify it.

However, when my workspace is started the resources.requests.cpu value for the container will be equal to the resources.requests.cpu. This behavior can result in under-utilized cluster nodes which cannot have any more Pods scheduled.

Che version

7.90

Steps to reproduce

  1. Create a workspace from: https://github.com/cgruver/devspaces-tiny-workspace.git

  2. Modify the devfile.yaml by adding cpuLimit & memoryLimit

schemaVersion: 2.2.0
attributes:
  controller.devfile.io/storage-type: per-workspace
metadata:
  name: tiny-workspace
components:
- name: dev-tools
  container:
    image: quay.io/cgruver0/che/tiny-workspace
    cpuLimit: 1000m
    memoryLimit: 5Gi
  1. Restart the workspace from the local devfile.yaml

  2. Observe the Pod of the workspace:

  containers:
    - resources:
        limits:
          cpu: 1500m
          memory: 6Gi
        requests:
          cpu: 1500m
          memory: 320Mi

Expected behavior

I expect that there is a default value for resources.requests.cpu

Runtime

OpenShift

Screenshots

No response

Installation method

OperatorHub

Environment

macOS

Eclipse Che Logs

No response

Additional context

No response

@cgruver cgruver added the kind/bug Outline of a bug - must adhere to the bug report template. label Nov 12, 2024
@che-bot che-bot added the status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. label Nov 12, 2024
@dkwon17 dkwon17 added area/devworkspace-operator severity/P1 Has a major impact to usage or development of the system. and removed status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. labels Nov 12, 2024
@ibuziuk
Copy link
Member

ibuziuk commented Nov 13, 2024

@cgruver I might be missing smth. but why don't you specify requests on devfile level if you want them to be applied?

  • I believe if namespace has the LimitRanges set with default requests those would be applied
Screenshot 2024-11-13 at 13 36 23

The current behaviour is expected even though it is a nuanced approach with the limit = request strategy

@ibuziuk ibuziuk added severity/P2 Has a minor but important impact to the usage or development of the system. status/analyzing An issue has been proposed and it is currently being analyzed for effort and implementation approach and removed severity/P2 Has a minor but important impact to the usage or development of the system. severity/P1 Has a major impact to usage or development of the system. labels Nov 13, 2024
@cgruver
Copy link
Author

cgruver commented Nov 13, 2024

Yeah, I understand the approach. And I am also a HUGE advocate for developer autonomy. But in this particular case, I have a different opinion derived from hard experience.

In an enterprise with a couple thousand developers writing devfiles, it is very common for them to not grasp the consequences of their configuration.

I have seen too many underutilized but over-scaled OpenShift clusters because developers are setting their resources.cpu.request to an unreasonably high value.

If a platform engineer can set a reasonable value for workspace requests and limits as a default, then more experienced developers can set limit values higher as needed, without causing the Kubernetes scheduler to consider a node full, when it is not.

IMO, it's a different paradigm from production workloads which may need a certain guarantee. Developer workloads are much more dynamic, and spend a lot more time idle.

@ibuziuk
Copy link
Member

ibuziuk commented Nov 13, 2024

Well, Regarding CPU limits in general, I advocate not to set them at all - if you set them your workloads are throttled by definition.

Limits for CPU for soft-tenancy pods are probably not going to be helpful unless you’re approaching very dense setups (> 10 pods per core) - otherwise, you will waste more CPU throttling than you save. CPU Limits definitely increase tail latencies for most non-predictable workloads (almost all request-driven use cases) in a way that will result in a worse overall application environment for most users most of the time (because of how limits are sliced). At lower pods per core, you are almost certainly trading a false security for a worse quality of service.

CPU Limits are most useful when dealing with bad actors on your own platform, and even then, there are far more effective ways of dealing with bad actors like detection and account blocking. I would tell not to put limits on that and instead put more effort into load monitoring, capacity planning, and reactive scheduling. Unless limits are so high as to bypass all protection for normal use, you’re doing the service users harm.

it is very common for them to not grasp the consequences of their configuration.

Limits / Requests for CPU is a very subtle contended topic - no matter what config engineers put in place with or without CPU limits they do need to understand exactly what they are doing.

@cgruver
Copy link
Author

cgruver commented Nov 13, 2024

I totally agree with you with respect to CPU limits... except for developer CDE workloads. The one reason that I would advocate for limits, is the noisy neighbor affect.

My opinion in this matter changed because of experience with multi-threaded developer workloads that inadvertently saturated the CPU of a node, impacting the other workspaces on the node. Unlike tested application deployments, developer CDE workloads are very unpredictable.

It's definitely a nuanced topic.

I've seen it happen enough to believe that there's a reasonable approach to be considered.

This new capability may be sufficient - #23176

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/devworkspace-operator kind/bug Outline of a bug - must adhere to the bug report template. severity/P2 Has a minor but important impact to the usage or development of the system. status/analyzing An issue has been proposed and it is currently being analyzed for effort and implementation approach
Development

No branches or pull requests

4 participants