Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics for tracking live servers #113

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

iainlane
Copy link
Contributor

Problem

Occasionally we've noticed that the autoscalers know about instances that the VM providers (e.g. GCP or AWS) don't or vice versa, probably due to latent bugs in the autoscaler. I'd like to expose the servers from the autoscaler and then we can correlate both sides to have an alert when they get out of sync. Then we should be able to look in logs and hopefully identify something that can help get the bug(s) fixed. Currently, we might not notice until a bit later and then it's hard to know where to drill down to.

Proposal

Add a new metric drone_server_known_instance with a label for the instance name.

This should allow us to correlate the servers that drone thinks it knows about with those that GCP has.

Technically this has unbounded cardinality but I think it's okay in reality since the metric is deleted when a server goes away and we'll really only be doing instant ("give me the values now") queries on this.

Would be interested in people's thoughts on this.

Alternatives

drone_server_count

There is already drone_server_count which could gain some labels.

  1. I'm not sure if that counts as an "API break", could it make existing queries stop working? Actually don't know the answer to that one.
  2. Would cardinality be more of a problem there, is that a relevant concern?

Doing something with logs

For completeness-

We could make sure to log nice clear messages when servers turn up and leave.

  1. It'll be annoying to correlate if done manually
  2. Unless we use a recording rule somehow
  3. Which then has the same problems as the others, so why not use a metric directly?

@iainlane
Copy link
Contributor Author

I'll fix up the test failures in a few days when I get a chance, still would appreciate feedback on the approach 🙂

@iainlane
Copy link
Contributor Author

tests are fixed!

@iainlane
Copy link
Contributor Author

@eoinmccafee00 I think that is about a different issue, can you re-check please?

@eoinmcafee00
Copy link
Contributor

@eoinmccafee00 I think that is about a different issue, can you re-check please?

Apologizes yeah I closed the wrong ticket.

@eoinmcafee00
Copy link
Contributor

Hey @iainlane

Can you wrap a feature flag around this, please? I'd rather not have it enabled by default for now.

Cheers,
Eoin

This should allow us to correlate the servers that drone thinks it knows
about with those that GCP has
Now we pass a collector to `ServerDelete()`, more metrics are in the
registry and the ones we want are at the end.
Only expose the new metrics when this variable is set
@iainlane iainlane force-pushed the iainlane/add-server-tracking-metrics branch from 7888e52 to 8ee3814 Compare July 21, 2022 13:22
@iainlane
Copy link
Contributor Author

@eoinmcafee00 okay, re-pushed

I checked this works as expected, with the flag set we get metrics like

# HELP drone_server_known_instance Known server instances.
# TYPE drone_server_known_instance gauge
drone_server_known_instance{name="drone-linux-amd64-SeNtfUdr",provider="google",region="us-central1-a",size="e2-standard-2"} 1

I should be able to join that with the lists from the CSPs & we'll be able to see if one side knows about an instance the other doesn't.


func init() {
registerKnownServers, _ = strconv.ParseBool(
os.Getenv("DRONE_AUTOSCALER_REGISTER_KNOWN_SERVERS"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this to https://github.com/drone/autoscaler/blob/master/config/config.go & inject the config struct, similar to how we do it elsewhere please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, pushed, can you re-review please?

Rename to DRONE_METRICS_REGISTER_KNOWN_SERVERS, expose via config
struct, reorder imports
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants