kepler node_package does not equal total of kepler_process_package #1837

sthaha · 2024-11-06T02:02:28Z

Steps to reproduce on a Baremetal

deploy kepler
curl /metrics
grep for node_package_joules
grep for kepler_process_package_joules
sum the values of process
see if the total is different to node_package

Expected: there shouldn't be any significant difference
Actual: The difference quite large and grows over time

Using Prometheus

kepler_node_package_joules_total{job="metal"} - on() sum(kepler_process_package_joules_total{job="metal"})

The text was updated successfully, but these errors were encountered:

marvin-steinke · 2024-11-06T08:54:36Z

do you think this is related to #1833 ?

sthaha · 2025-02-20T09:30:56Z

@marvin-steinke , I don't think this is related but in relation to this bug it turns out this is an expected behaviour from kepler.

The explanation is that kepler_node_package_joules_total counter keeps track of the joules count from the time kepler is running while kepler_process_package_joules_total only tracks running processes (and not terminated ones). Thus it is expected to have node_package_joules_total > sum(kepler_process_package_joules_total)

So the right test is if sum(rate(kepler_node_package_joules_total[30s])) == sum(rate(kepler_process_package_joules_total[30s])). I.E. is the node's power in Watts equal to the watts allocated to processes. My tests show a round off error which can certainly be minimised.

I see that some times (when there is a spike in power use), kepler fails to allocate the power usage to all running processes correctly.
As shown in this screenshot below

The red line is rate(kepler_node_package_joules_total) and the yellow line is sum(rate(kepler_process_package_joules_total)). These lines are supposed to be the same but they aren't. But in most cases, it tracks pretty well. I need to investigate further why happens to be case.

@rootfs any thoughts ?

sthaha self-assigned this Nov 6, 2024

sthaha linked a pull request Feb 20, 2025 that will close this issue

WIP: Reduce node proc diff #1927

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kepler node_package does not equal total of kepler_process_package #1837

kepler node_package does not equal total of kepler_process_package #1837

sthaha commented Nov 6, 2024

marvin-steinke commented Nov 6, 2024

sthaha commented Feb 20, 2025

kepler node_package does not equal total of kepler_process_package #1837

kepler node_package does not equal total of kepler_process_package #1837

Comments

sthaha commented Nov 6, 2024

marvin-steinke commented Nov 6, 2024

sthaha commented Feb 20, 2025