Snapshot metrics #121

JohnStrunk · 2021-04-06T12:42:50Z

Describe the feature you'd like to have.
Currently, snapscheduler doesn't provide any metrics related to the snapshots attempted/created. It would be good to provide some stats that could be monitored/alerted

What is the value to the end user? (why is it a priority?)
Users that depend on having snapshots to protect their data should have a way to monitor whether those snapshots are being successfully created

How will we know we have a good solution? (acceptance criteria)

Additional context
cc: @prasanjit-enginprogam

prasanjit-enginprogam · 2021-04-07T03:26:49Z

@JohnStrunk: Here are the additional stats that we are requesting for:

readyToUse boolean flag based on
- SnapshotSchedule Name
- Match Lables
- Namespaces
- VolumeSnapshotClass
current count of snapshot per PVC, so that team can get alerted if it reaches the maxCount number present in SnapshotSchedule.yaml (Kind: SnapshotSchedule), based on
- SnapshotSchedule Name
- Match Lables
- Namespaces
- VolumeSnapshotClass
Current count of count/volumesnapshots.snapshot.storage.k8s.io based on namespace.

our HELM CHART BASED YAML FILES ARE

snapschedule.yaml

apiVersion: snapscheduler.backube/v1
kind: SnapshotSchedule
metadata:
  name: consul-snapshot
  namespace: {{ .Values.namespace }}
spec:
  disabled: {{ .Values.snapshotDisabledFlag }}
  claimSelector:
    matchLabels:
      {{- range $key, $value := .Values.selector }}
        {{ $key }}: {{ $value | quote }}
      {{- end }}
  retention:
    expires: {{ .Values.snapshotExpiry }}
    maxCount: {{ .Values.maxCount }}
  schedule: {{ .Values.schedule }}
  snapshotTemplate:
    lables:
      {{- range $key, $value := .Values.selector }}
        {{ $key }}: {{ $value | quote }}
      {{- end }}
    snapshotClassName: {{ .Values.snapshotClassName }}

snapshotquota.yaml

apiVersion: v1
kind: ResourceQuota
metadata:
  name: volumesnapshotsquota
  namespace: {{ .Values.namespace }}
spec:
  hard:
    count/volumesnapshots.snapshot.storage.k8s.io: {{ .Values.snapshotQuota | quote }}

prasanjit-enginprogam · 2021-04-10T03:09:29Z

@JohnStrunk : Let me know if this looks okay to you

JohnStrunk · 2021-04-10T17:09:42Z

I think I'd like to limit the metrics to objects that SnapScheduler actually manages (i.e., not report on all snapshots, just those created from a schedule).
Perhaps:

current_snapshots_total - gauge - {labels: schedule_name, schedule_namespace, pvc_name}
- The number of VolumeSnapshot currently associated w/ the schedule, namespace, pvc
current_snapshots_ready_total - gauge - {labels: schedule_name, schedule_namespace, pvc_name}
- The number of snapshots that are currently readyToUse
snapshots_total - counter - {labels: schedule_name, schedule_namespace, pvc_name}
- The total number of snapshots that have been created

The trick is to get metrics that are useful, not too difficult to implement, and don't have terribly high cardinality for Prometheus.

prasanjit-enginprogam · 2021-04-12T06:15:49Z

"(i.e., not report on all snapshots, just those created from a schedule)." -- agreed
can we report if the snapshot is successful? i think point 1 is really important to us.

readyToUse boolean flag based on

SnapshotSchedule Name
Match Lables
Namespaces
VolumeSnapshotClass

JohnStrunk · 2021-04-12T12:23:21Z

I was hoping the ready_total vs total would be sufficient for that use case.

Could you explain a bit more about the need for match labels and VSC in the metrics? I'm particularly concerned about encoding the labels. If the labels and the VSC are determined by the SnapshotSchedule object, wouldn't it's name/namespace be sufficient?

prasanjit-enginprogam · 2021-04-13T03:27:38Z

@JohnStrunk: Here is our use-case, we are backing up few StatefulSet services under a specific namespace and they are identified by the "app" label currently. The ask is to notify if there is a backup failure so that the Ops team can take a look and fix the issue. we are using prometheus to scrape the "metrics" endpoint ---> alertmanager ---> Pagerduty and slack notify.

Currently, there is one single VSC that is tied to "ebs.csi.aws.com" but later we want to connect to different drivers such as EFS and create a separate VSC, so 1-1 mapping.

app-snapshot                    0 6 * * *      168h      15        false      2021-04-13T06:00:00Z   app.kubernetes.io/managed-by=spinnaker,app.kubernetes.io/name=ABC

$ kubectl get SnapshotSchedule -n NAMESPACE -l'app.kubernetes.io/name=ABC'
NAME              SCHEDULE    MAX AGE   MAX NUM   DISABLED   NEXT SNAPSHOT
app-snapshot   0 6 * * *   168h      15        false      2021-04-13T06:00:00Z
$

Now, this snapshot schedule taps to 3 different EBS volumes for the "app" cluster.

we want to get notified if :

one out of these 3 EBS volumes failed to get backed up?
All EBS volumes failed to get backup.
Backup didn't ran for some reason.

JohnStrunk · 2021-04-13T20:06:21Z

My thought here is that you'd monitor the "app-snapshot" schedule (by filtering on schedule_name) and expect 3 new ready snaps every day.
So, it would probably be good to add a corresponding snapshots_ready_total counter as well.
The failure of the snapshotting flow itself would have to be detected by it never becoming ready, but there's also a case to be made for adding an error counter, too. That could be incremented if the operator is unable to create the VolumeSnapshot object itself (e.g., quota or rbac problems).

prasanjit-enginprogam · 2021-04-14T07:33:51Z

@JohnStrunk: agreed. So far the plan looks good. Let me know once the implementation is done. I can test and let u know how it goes.

prasanjit-enginprogam · 2021-04-22T17:38:53Z

@JohnStrunk: Just a gentle reminder, Are there any updates? to us, having observability is backup is on high priority.
At least alerting if there is a failure based on some filters should be a good enough starting point.

JohnStrunk · 2021-04-23T13:34:08Z

While it's on my list of items I'd like to add, I don't have a timeline for you.
I'd be happy to provide guidance if you or one of your colleagues would like to work on a PR for it.

shomeprasanjit · 2021-09-23T23:47:01Z

Any updates yet @JohnStrunk

neema80 · 2023-07-05T11:19:38Z

seems like is abandoned

JohnStrunk · 2023-07-05T12:34:48Z

seems like is abandoned

As I said before... I'd be happy to provide guidance if someone wants to contribute a PR. However, there doesn't seem to be sufficient interest in this feature for anyone to make it happen.

KyriosGN0 · 2024-07-16T11:51:31Z

Hi @JohnStrunk, i would like to try to implement this, as i understand: there are 4 required metrics
snapshots_ready_total, current_snapshots_total, current_snapshots_ready_total, snapshots_total
is there anything else i should be tackling this ?

JohnStrunk · 2024-07-16T14:56:56Z

@KyriosGN0 That seems like a good summary. Thanks for offering to take a look!

mnacharov · 2024-08-30T14:06:51Z

I hope a more "general metrics solutions"(kube-state-metrics in my case) will finally add support of VolumeSnapshot and VolumeSnapshotContent metrics.
And backube/snapscheduler will just continue to create VolumeSnapshots

JohnStrunk added the enhancement New feature or request label Apr 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot metrics #121

Snapshot metrics #121

JohnStrunk commented Apr 6, 2021

prasanjit-enginprogam commented Apr 7, 2021 •

edited

Loading

prasanjit-enginprogam commented Apr 10, 2021

JohnStrunk commented Apr 10, 2021

prasanjit-enginprogam commented Apr 12, 2021

JohnStrunk commented Apr 12, 2021

prasanjit-enginprogam commented Apr 13, 2021 •

edited

Loading

JohnStrunk commented Apr 13, 2021

prasanjit-enginprogam commented Apr 14, 2021

prasanjit-enginprogam commented Apr 22, 2021

JohnStrunk commented Apr 23, 2021

shomeprasanjit commented Sep 23, 2021

neema80 commented Jul 5, 2023

JohnStrunk commented Jul 5, 2023

KyriosGN0 commented Jul 16, 2024 •

edited

Loading

JohnStrunk commented Jul 16, 2024

mnacharov commented Aug 30, 2024

Snapshot metrics #121

Snapshot metrics #121

Comments

JohnStrunk commented Apr 6, 2021

prasanjit-enginprogam commented Apr 7, 2021 • edited Loading

prasanjit-enginprogam commented Apr 10, 2021

JohnStrunk commented Apr 10, 2021

prasanjit-enginprogam commented Apr 12, 2021

JohnStrunk commented Apr 12, 2021

prasanjit-enginprogam commented Apr 13, 2021 • edited Loading

JohnStrunk commented Apr 13, 2021

prasanjit-enginprogam commented Apr 14, 2021

prasanjit-enginprogam commented Apr 22, 2021

JohnStrunk commented Apr 23, 2021

shomeprasanjit commented Sep 23, 2021

neema80 commented Jul 5, 2023

JohnStrunk commented Jul 5, 2023

KyriosGN0 commented Jul 16, 2024 • edited Loading

JohnStrunk commented Jul 16, 2024

mnacharov commented Aug 30, 2024

prasanjit-enginprogam commented Apr 7, 2021 •

edited

Loading

prasanjit-enginprogam commented Apr 13, 2021 •

edited

Loading

KyriosGN0 commented Jul 16, 2024 •

edited

Loading