Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kepler node_package does not equal total of kepler_process_package #1837

Open
sthaha opened this issue Nov 6, 2024 · 2 comments · May be fixed by #1927
Open

kepler node_package does not equal total of kepler_process_package #1837

sthaha opened this issue Nov 6, 2024 · 2 comments · May be fixed by #1927
Assignees

Comments

@sthaha
Copy link
Collaborator

sthaha commented Nov 6, 2024

Steps to reproduce on a Baremetal

  • deploy kepler
  • curl /metrics
  • grep for node_package_joules
  • grep for kepler_process_package_joules
  • sum the values of process
  • see if the total is different to node_package

Expected: there shouldn't be any significant difference
Actual: The difference quite large and grows over time

Using Prometheus

  • kepler_node_package_joules_total{job="metal"} - on() sum(kepler_process_package_joules_total{job="metal"})

image

@sthaha sthaha self-assigned this Nov 6, 2024
@marvin-steinke
Copy link

do you think this is related to #1833 ?

@sthaha
Copy link
Collaborator Author

sthaha commented Feb 20, 2025

@marvin-steinke , I don't think this is related but in relation to this bug it turns out this is an expected behaviour from kepler.

The explanation is that kepler_node_package_joules_total counter keeps track of the joules count from the time kepler is running while kepler_process_package_joules_total only tracks running processes (and not terminated ones). Thus it is expected to have node_package_joules_total > sum(kepler_process_package_joules_total)

So the right test is if sum(rate(kepler_node_package_joules_total[30s])) == sum(rate(kepler_process_package_joules_total[30s])). I.E. is the node's power in Watts equal to the watts allocated to processes. My tests show a round off error which can certainly be minimised.

I see that some times (when there is a spike in power use), kepler fails to allocate the power usage to all running processes correctly.
As shown in this screenshot below

Image

The red line is rate(kepler_node_package_joules_total) and the yellow line is sum(rate(kepler_process_package_joules_total)). These lines are supposed to be the same but they aren't. But in most cases, it tracks pretty well. I need to investigate further why happens to be case.

Image

@rootfs any thoughts ?

@sthaha sthaha linked a pull request Feb 20, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants