You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@marvin-steinke , I don't think this is related but in relation to this bug it turns out this is an expected behaviour from kepler.
The explanation is that kepler_node_package_joules_total counter keeps track of the joules count from the time kepler is running while kepler_process_package_joules_total only tracks running processes (and not terminated ones). Thus it is expected to have node_package_joules_total > sum(kepler_process_package_joules_total)
So the right test is if sum(rate(kepler_node_package_joules_total[30s])) == sum(rate(kepler_process_package_joules_total[30s])). I.E. is the node's power in Watts equal to the watts allocated to processes. My tests show a round off error which can certainly be minimised.
I see that some times (when there is a spike in power use), kepler fails to allocate the power usage to all running processes correctly.
As shown in this screenshot below
The red line is rate(kepler_node_package_joules_total) and the yellow line is sum(rate(kepler_process_package_joules_total)). These lines are supposed to be the same but they aren't. But in most cases, it tracks pretty well. I need to investigate further why happens to be case.
Steps to reproduce on a Baremetal
node_package_joules
Expected: there shouldn't be any significant difference
Actual: The difference quite large and grows over time
Using Prometheus
kepler_node_package_joules_total{job="metal"} - on() sum(kepler_process_package_joules_total{job="metal"})
The text was updated successfully, but these errors were encountered: