Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubleshooting Windows Exporter on EKS: Metrics Missing for Exporter Container and Callback/Connection Errors in Logs #1861

Open
paoloyx opened this issue Jan 28, 2025 · 5 comments

Comments

@paoloyx
Copy link

paoloyx commented Jan 28, 2025

Problem Statement

I apologize in advance if the overall request could seem confused, I'll try to do my best to explain the problem I'm facing. They're actually two "problems", the double hyphen is mandatory here as it could very be a wrong interpretation of what the exporter is doing, or a misconfiguration on my side. On top of all, I'm absolutely not an expert about Windows operating systems.

That said, I've got a windows exporter installation on a EKS cluster. There are two things that I do not understand:

  • I can correctly get metrics from all running containers but the windows exporter itself, for example by using the windows_container_cpu_usage_seconds_total metric. Is that correct? It is due to the the fact that the exporter is running as a Windows process?

  • I've got a lot of unclear error in logs, and I do not know what their meaning is:

[...]
windows-exporter time="2025-01-28T16:09:56Z" level=error msg="failed to waitForNotification: callbackNumber does not exist in callbackMap" callbackNumber=9
windows-exporter ts=2025-01-28T16:09:57.500Z caller=stdlib.go:105 level=error caller=http.go:192 msg="error encoding and sending metric family: write tcp 172.20.71.41:9182->172.20.71.41:50570: wsasend: An established connection was aborted by the software in your host machine."
windows-exporter ts=2025-01-28T16:09:58.449Z caller=stdlib.go:105 level=error caller=http.go:192 msg="error encoding and sending metric family: write tcp 172.20.71.41:9182->172.20.71.41:50571: wsasend: An established connection was aborted by the software in your host machine."
windows-exporter time="2025-01-28T16:10:00Z" level=error msg="failed to waitForNotification: callbackNumber does not exist in callbackMap" callbackNumber=22
windows-exporter time="2025-01-28T16:10:00Z" level=error msg="failed to waitForNotification: callbackNumber does not exist in callbackMap" callbackNumber=23
windows-exporter time="2025-01-28T16:10:00Z" level=error msg="failed to waitForNotification: callbackNumber does not exist in callbackMap" callbackNumber=24
windows-exporter time="2025-01-28T16:10:04Z" level=error msg="failed to waitForNotification: callbackNumber does not exist in callbackMap" callbackNumber=37
windows-exporter time="2025-01-28T16:10:04Z" level=error msg="failed to waitForNotification: callbackNumber does not exist in callbackMap" callbackNumber=40
windows-exporter time="2025-01-28T16:10:04Z" level=error msg="failed to waitForNotification: callbackNumber does not exist in callbackMap" callbackNumber=39
windows-exporter ts=2025-01-28T16:10:07.582Z caller=stdlib.go:105 level=error caller=http.go:192 msg="error encoding and sending metric family: write tcp 172.20.71.41:9182->172.20.71.41:50577: wsasend: An established connection was aborted by the software in your host machine."
windows-exporter ts=2025-01-28T16:10:12.069Z caller=prometheus.go:168 level=warn msg="Collection timed out, still waiting for [service]"
windows-exporter ts=2025-01-28T16:10:13.484Z caller=prometheus.go:168 level=warn msg="Collection timed out, still waiting for [service]"
windows-exporter ts=2025-01-28T16:10:13.485Z caller=stdlib.go:105 level=error caller=http.go:192 msg="error encoding and sending metric family: write tcp 172.20.71.41:9182->172.20.71.41:50578: wsasend: An established connection was aborted by the software in your host machine."
[...]

Anyone can give help? I'm not clearly understanding what it's going on, unfortunately.
It's not immediate to update to newer versions of the exporter, but I could proceed in that sense if it's the better way to proceed.
Thanks

Environment

  • windows_exporter Version: 0.26.1
  • Windows Server Version: Windows Server 2019
@jkroepke
Copy link
Member

Hi,

I can correctly get metrics from all running containers but the windows exporter itself, for example by using the windows_container_cpu_usage_seconds_total metric. Is that correct? It is due to the the fact that the exporter is running as a Windows process?

Yes! Correct. windows_exporter runs as hostprocess to access the context on a host. hostprocess means, just start the process as normal process, like the service manager would do. In conclusion, the Host Compute System (HCS) does not offer metrics for that kind of processes. This can mit mitigated by using the process collector.

windows-exporter ts=2025-01-28T16:10:07.582Z caller=stdlib.go:105 level=error caller=http.go:192 msg="error encoding and sending metric family: write tcp 172.20.71.41:9182->172.20.71.41:50577: wsasend: An established connection was aborted by the software in your host machine."

It seems that the request takes longer that the configured prometheus scrape timeout. Prometheus aborted the request while windows_exporter tries write data on the connection.

windows-exporter ts=2025-01-28T16:10:13.484Z caller=prometheus.go:168 level=warn msg="Collection timed out, still waiting for [service]"

the service collector was known for slow times. This can be mitigated in 0.26 by using collector.service.use-api or collector.service.v2. Since 0.29, collector.service.v2 is the default and alternatives not longer exists.

windows-exporter time="2025-01-28T16:10:04Z" level=error msg="failed to waitForNotification: callbackNumber does not exist in callbackMap" callbackNumber=37

The messages are unknown to me. It seems they are coming from https://github.com/microsoft/hcsshim/blob/8d81359dc374e39d9edd63639a0402fbbea694f9/internal/hcs/waithelper.go#L37 directly, since the log format is different. time vs ts and caller is missing.

It could be possible that this is a follow-up error. if prometheus cancels the request, other contexts with-in the exporter are cancel as will which may lead to that situation.

Action items:

  • Reduce the amount of collectors to a minimum.
  • Use collector.service.v2 flag, if server collector can't be disabled.
  • Increate Scrape Timeout and Interval

There are tons of performance fixes in 0.29 and 0.30 which may resolve the issues as well.

@paoloyx
Copy link
Author

paoloyx commented Jan 29, 2025

Thanks a lot @jkroepke for the time taken to give this thorough response, I really appreciate it.
So, for what it relates to "self" exporter metrics that's ok, it's not possible to get them but we can mitigate via the process collector, as you suggested..maybe we don't even need them, but it's reassuring to know that we can get them if we wanted to.

For errors found in logs, instead, I've updated the chart to the latest version 0.8.0 (so now the exporter runs at 0.29.2 version) and they all are indeed gone but the ones related to the hcsshim library --> https://github.com/microsoft/hcsshim/blob/8d81359dc374e39d9edd63639a0402fbbea694f9/internal/hcs/waithelper.go#L37

[...]
windows-exporter time="2025-01-29T14:07:57Z" level=error msg="failed to waitForNotification: callbackNumber does not exist in callbackMap" callbackNumber=1095
windows-exporter time="2025-01-29T14:07:57Z" level=error msg="failed to waitForNotification: callbackNumber does not exist in callbackMap" callbackNumber=1096
windows-exporter time="2025-01-29T14:07:57Z" level=error msg="failed to waitForNotification: callbackNumber does not exist in callbackMap" callbackNumber=1097
windows-exporter time="2025-01-29T14:07:57Z" level=error msg="failed to waitForNotification: callbackNumber does not exist in callbackMap" callbackNumber=1098
windows-exporter time="2025-01-29T14:07:57Z" level=error msg="failed to waitForNotification: callbackNumber does not exist in callbackMap" callbackNumber=1099
windows-exporter time="2025-01-29T14:08:12Z" level=error msg="failed to waitForNotification: callbackNumber does not exist in callbackMap" callbackNumber=1100
windows-exporter time="2025-01-29T14:08:12Z" level=error msg="failed to waitForNotification: callbackNumber does not exist in callbackMap" callbackNumber=1102
windows-exporter time="2025-01-29T14:08:12Z" level=error msg="failed to waitForNotification: callbackNumber does not exist in callbackMap" callbackNumber=1103
windows-exporter time="2025-01-29T14:08:12Z" level=error msg="failed to waitForNotification: callbackNumber does not exist in callbackMap" callbackNumber=1104
windows-exporter time="2025-01-29T14:08:12Z" level=error msg="failed to waitForNotification: callbackNumber does not exist in callbackMap" callbackNumber=1105
windows-exporter time="2025-01-29T14:08:12Z" level=error msg="failed to waitForNotification: callbackNumber does not exist in callbackMap" callbackNumber=1106
windows-exporter time="2025-01-29T14:08:12Z" level=error msg="failed to waitForNotification: callbackNumber does not exist in callbackMap" callbackNumber=1109
windows-exporter time="2025-01-29T14:08:12Z" level=error msg="failed to waitForNotification: callbackNumber does not exist in callbackMap" callbackNumber=1108
[...]

The callbackNumber is monotonically increasing at every log line, honestly the comprehension of what's happening is beyond my current knowledge but the exporter is working great and is giving us all the metrics we need.

So I think that we can close this one and maybe I can try to help with some debugging info if it's useful for the project.
Thanks agains for your support :)

@jkroepke
Copy link
Member

Technically its a bug, and all bug should be concerned.

I might have to raise an upstream issue here. I need the version of the container as well.

In your case, I need the AMI of your windows nodes. All other infomations can be found here:

https://docs.aws.amazon.com/eks/latest/userguide/eks-ami-versions-windows.html

@jkroepke
Copy link
Member

is giving us all the metrics we need.

Do you have metrics like windows_container_network_receive_packets_total?

@paoloyx
Copy link
Author

paoloyx commented Jan 30, 2025

Hi @jkroepke

Do you have metrics like windows_container_network_receive_packets_total?

Yes, I can confirm that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants