Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grafana dashboard: add an uptime panel to overview #10762

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

guoard
Copy link

@guoard guoard commented Mar 18, 2024

Proposed Changes

This pull request adds an uptime panel to the RabbitMQ overview Grafana dashboard.
By incorporating this feature, users can easily track the uptime of RabbitMQ instance.

Types of Changes

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally with my changes
  • If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

@mkuratczyk mkuratczyk self-assigned this Mar 18, 2024
@mkuratczyk
Copy link
Contributor

Thanks a lot for contributing. Unfortunately it doesn't work well as currently implemented. If you restart the pods, they get a new identity in this panel - rather than an updated (shorter) uptime, you will see multiple rows for each pod:
Screenshot 2024-03-18 at 10 04 59

To reproduce the problem, just kubectl rollout restart statefulset foo and check the dashboard afterwards.

If you can fix this, I'm happy to merge.

@michaelklishin michaelklishin changed the title Add uptime panel to rabbitmq overview grafana dashboard Grafana dashboard: add an uptime panel to overview Mar 18, 2024
@guoard
Copy link
Author

guoard commented Mar 19, 2024

@mkuratczyk thank you for your time.
I pushed another commit, that should fix the problem in k8s statefulset.

@mkuratczyk
Copy link
Contributor

I'm afraid it still doesn't work when node restarts happen (which is kind of the whole point). Looking at a cluster that went through multiple node restarts I see this:

Screenshot 2024-03-21 at 08 38 55

@guoard
Copy link
Author

guoard commented Mar 24, 2024

I conducted several tests on a 2-node k3s cluster with 5 instances of RabbitMQ, but I couldn't replicate the issue you described. However, I'm keen to assist further.

Firstly, could you kindly verify that the Prometheus query used matches the following:

rabbitmq_erlang_uptime_seconds * on(instance, job) group_left(rabbitmq_cluster) rabbitmq_identity_info{rabbitmq_cluster="$rabbitmq_cluster", namespace="$namespace"}

Assuming the query aligns, it would be immensely helpful if you could provide additional details or steps that may aid in reproducing the issue. This could include specific configurations, environmental factors, or any other relevant information that might shed light on the problem. Thank you in advance for your assistance in resolving this matter.

@guoard
Copy link
Author

guoard commented Mar 24, 2024

this the manifest I used to run Rabbitmq cluster:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: foo
spec:
  replicas: 5
  service:
    type: NodePort

@mkuratczyk
Copy link
Contributor

I can reproduce this even with a single node - just deploy it and the delete the pod to make it restart. It gets a new IP address and "becomes a new instance" (you can see the difference in the labels):
Screenshot 2024-03-25 at 08 46 35

@guoard
Copy link
Author

guoard commented Mar 27, 2024

Thank you for providing additional details.

I haven't faced this issue as my monitoring setup operates outside the Kubernetes cluster, with the instance label manually defined.

It appears challenging to correlate the rabbitmq_erlang_uptime_seconds metric with rabbitmq_identity_info without a unique label on the rabbitmq_identity_info metric. Without this, mapping seems unfeasible.

If you agree with my assessment, please consider closing the PR.

@mkuratczyk mkuratczyk marked this pull request as draft March 27, 2024 08:53
@mkuratczyk
Copy link
Contributor

I think uptime would be indeed valuable on the dashboard and I'm sure we can solve the query problem. I converted this to a draft PR and will have a look at fixing this problem when I have more time.

@guoard
Copy link
Author

guoard commented Mar 28, 2024

What are your thoughts on adopting the following approach?

max(max_over_time(QUERY[$__interval]))

I'm unsure of the exact implementation details for the query at the moment. However, employing this method would enable us to track the maximum uptime within a specified custom interval.

@michaelklishin
Copy link
Member

@mkuratczyk do you have an opinion on this approach? #10762 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants