Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kopf does not restart after 429 too many requests error #1108

Open
msfrucht opened this issue Feb 27, 2024 · 4 comments
Open

Kopf does not restart after 429 too many requests error #1108

msfrucht opened this issue Feb 27, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@msfrucht
Copy link

msfrucht commented Feb 27, 2024

Long story short

We have kopf 1.37.1 running a watch stream on a single custom resource cluster wide. Eventually we received a 429 too many requests error from the kube-apiserver. The kopf watch did not restart after this error. It shutdown and continued running the kopf liveness and readiness probe locations.

Based on the logs it doesn't look too many requests at all.
We had the following updates to the CR list.
Line 813: 2024-02-26T06:02:22.849697098Z
Line 1256: 2024-02-26T07:02:27.741847287Z
Line 1383: 2024-02-26T07:18:02.638854440Z
Line 1706: 2024-02-26T08:02:29.404285926Z
Line 2148: 2024-02-26T09:02:30.995018345Z
Line 2583: 2024-02-26T10:02:32.614030753Z
Line 3059: 2024-02-26T11:02:36.687348782Z
Line 3249: 2024-02-26T11:27:30.696478118Z
Line 3479: 2024-02-26T11:54:46.913868429Z
Line 3958: 2024-02-26T12:54:51.819996162Z

Then we hit the following error from kopf.

2024-02-26T13:43:38.870237767Z [2024-02-26 13:43:38,870] kopf._cogs.clients.w [DEBUG   ] Stopping the watch-stream for applications.v1alpha1.application.isf.ibm.com cluster-wide.
2024-02-26T13:43:38.872325055Z [2024-02-26 13:43:38,870] kopf._core.reactor.o [ERROR   ] Watcher for applications.v1alpha1.application.isf.ibm.com@none has failed: (None, None)
2024-02-26T13:43:38.872325055Z Traceback (most recent call last):
2024-02-26T13:43:38.872325055Z   File "/usr/src/app/code/.venv/lib64/python3.11/site-packages/kopf/_cogs/clients/errors.py", line 148, in check_response
2024-02-26T13:43:38.872325055Z     response.raise_for_status()
2024-02-26T13:43:38.872325055Z   File "/usr/src/app/code/.venv/lib64/python3.11/site-packages/aiohttp/client_reqrep.py", line 1060, in raise_for_status
2024-02-26T13:43:38.872325055Z     raise ClientResponseError(
2024-02-26T13:43:38.872325055Z aiohttp.client_exceptions.ClientResponseError: 429, message='Too Many Requests', url=URL('https://172.30.0.1:443/apis/application.isf.ibm.com/v1alpha1/applications?watch=true&resourceVersion=9409415&timeoutSeconds=600')
2024-02-26T13:43:38.872325055Z 
2024-02-26T13:43:38.872325055Z The above exception was the direct cause of the following exception:
2024-02-26T13:43:38.872325055Z 
2024-02-26T13:43:38.872325055Z Traceback (most recent call last):
2024-02-26T13:43:38.872325055Z   File "/usr/src/app/code/.venv/lib64/python3.11/site-packages/kopf/_cogs/aiokits/aiotasks.py", line 96, in guard
2024-02-26T13:43:38.872325055Z     await coro
2024-02-26T13:43:38.872325055Z   File "/usr/src/app/code/.venv/lib64/python3.11/site-packages/kopf/_core/reactor/queueing.py", line 175, in watcher
2024-02-26T13:43:38.872325055Z     async for raw_event in stream:
2024-02-26T13:43:38.872325055Z   File "/usr/src/app/code/.venv/lib64/python3.11/site-packages/kopf/_cogs/clients/watching.py", line 86, in infinite_watch
2024-02-26T13:43:38.872325055Z     async for raw_event in stream:
2024-02-26T13:43:38.872325055Z   File "/usr/src/app/code/.venv/lib64/python3.11/site-packages/kopf/_cogs/clients/watching.py", line 201, in continuous_watch
2024-02-26T13:43:38.872325055Z     async for raw_input in stream:
2024-02-26T13:43:38.872325055Z   File "/usr/src/app/code/.venv/lib64/python3.11/site-packages/kopf/_cogs/clients/watching.py", line 266, in watch_objs
2024-02-26T13:43:38.872325055Z     async for raw_input in api.stream(
2024-02-26T13:43:38.872325055Z   File "/usr/src/app/code/.venv/lib64/python3.11/site-packages/kopf/_cogs/clients/api.py", line 200, in stream
2024-02-26T13:43:38.872325055Z     response = await request(
2024-02-26T13:43:38.872325055Z                ^^^^^^^^^^^^^^
2024-02-26T13:43:38.872325055Z   File "/usr/src/app/code/.venv/lib64/python3.11/site-packages/kopf/_cogs/clients/auth.py", line 45, in wrapper
2024-02-26T13:43:38.872325055Z     return await fn(*args, **kwargs, context=context)
2024-02-26T13:43:38.872325055Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-02-26T13:43:38.872325055Z   File "/usr/src/app/code/.venv/lib64/python3.11/site-packages/kopf/_cogs/clients/api.py", line 85, in request
2024-02-26T13:43:38.872325055Z     await errors.check_response(response)  # but do not parse it!
2024-02-26T13:43:38.872325055Z     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-02-26T13:43:38.872325055Z   File "/usr/src/app/code/.venv/lib64/python3.11/site-packages/kopf/_cogs/clients/errors.py", line 150, in check_response
2024-02-26T13:43:38.872325055Z     raise cls(payload, status=response.status) from e
2024-02-26T13:43:38.872325055Z kopf._cogs.clients.errors.APIClientError: (None, None)
2024-02-26T13:44:16.563238107Z [2024-02-26 13:44:16,563] root                 [INFO    ] Probe response True, {'connected': True}
2024-02-26T13:44:16.564237153Z [2024-02-26 13:44:16,564] kopf.activities.prob [INFO    ] Activity 'check_kafka_health' succeeded.
2024-02-26T13:44:16.564269573Z [2024-02-26 13:44:16,564] kopf.activities.prob [DEBUG   ] Activity 'check_rbacs' is invoked.
2024-02-26T13:44:16.606139781Z [2024-02-26 13:44:16,606] kopf.activities.prob [INFO    ] Activity 'check_rbacs' succeeded.

All of the available watching timeout settings have been set. But no restart occurs. This fails silently without ending the kopf executable making it impossible to recover from.

settings.watching.connect_timeout = 60
settings.watching.server_timeout = 600
settings.watching.client_timeout = 660
settings.watching.reconnect_backoff = 1

Is there some setting or something we are missing to deal with this issue? Any execution recommendations that can be done to avoid this issue?

Kopf version

1.37.1

Kubernetes version

v1.27.8+4fab27b - Red Hat Openshift 4.14.8

Python version

3.11

Code

Our kopf configuration. Never been able to reproduce this issue reliably.

@kopf.on.startup()
def configure(settings: kopf.OperatorSettings, **_):
    settings.posting.level = logging.DEBUG
    # timeouts are in seconds
    settings.watching.connect_timeout = 60
    settings.watching.server_timeout = 600
    settings.watching.client_timeout = 660
    settings.watching.reconnect_backoff = 1
    settings.networking.error_backoffs = InfiniteBackoff()
    settings.batching.error_delays = [10, 20, 30, 40]
    settings.peering.name = "application-peers"
    settings.peering.stealth = True
    settings.persistence.finalizer = "application-controller"
    settings.posting.enabled = False
    application_discovery_state.isDiscoveryCompleted = Fals

Logs

..setup at 2024-02-26T06:01:32.599757687Z
...probe repeats
2024-02-26T11:54:46.913868429Z [2024-02-26 11:54:46,913] kopf.objects         [DEBUG   ] [ibm-spectrum-fusion-ns/clusters-hcp24feb] Updating is in progress: {'apiVersion': 'application.isf.ibm.com/v1alpha1', 'kind': 'Application', 'metadata': {'annotations': {'isf.ibm.com/created-event': 'created', 'kopf.zalando.org/last-handled-configuration': '{"spec":{"enableDR":false,"includedNamespaces":["clusters-hcp24feb"]},"metadata":{"labels":{"isf.ibm.com/clusterId":"b8f4b4bf-b113-417a-8708-3fc1ce10d9ce"},"annotations":{"isf.ibm.com/created-event":"created","usedcapacity":"1884758016"},"uid":"1b79fa23-5b73-4a46-806f-73f3dad59db3"}}\n', 'usedcapacity': '1884819456'}, 'creationTimestamp': '2024-02-24T04:17:17Z', 'finalizers': ['application-controller'], 'generation': 1, 'labels': {'isf.ibm.com/clusterId': 'b8f4b4bf-b113-417a-8708-3fc1ce10d9ce'}, 'managedFields': [{'apiVersion': 'application.isf.ibm.com/v1alpha1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:annotations': {'f:kopf.zalando.org/last-handled-configuration': {}}, 'f:finalizers': {'.': {}, 'v:"application-controller"': {}}}}, 'manager': 'kopf', 'operation': 'Update', 'time': '2024-02-26T11:02:36Z'}, {'apiVersion': 'application.isf.ibm.com/v1alpha1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:status': {'.': {}, 'f:activeRestore': {}, 'f:availableBackups': {}, 'f:backupStatus': {}, 'f:drStatus': {'.': {}, 'f:initialSyncComplete': {}}, 'f:lastSuccessfulBackupTime': {}, 'f:observedGeneration': {}, 'f:pvcStatuses': {}, 'f:storageClassCapacities': {'.': {}, 'f:lvms-hcp-etcd': {'.': {}, 'f:numPVCs': {}, 'f:totalCapacity': {}, 'f:usedCapacity': {}}, 'f:ocs-storagecluster-ceph-rbd': {'.': {}, 'f:numPVCs': {}, 'f:totalCapacity': {}, 'f:usedCapacity': {}}}, 'f:totalBackups': {}, 'f:totalCapacity': {}, 'f:totalPVCCapacity': {}, 'f:usedCapacity': {}, 'f:usedPVCCapacity': {}}}, 'manager': 'manager', 'operation': 'Update', 'subresource': 'status', 'time': '2024-02-26T11:02:38Z'}, {'apiVersion': 'application.isf.ibm.com/v1alpha1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:annotations': {'.': {}, 'f:isf.ibm.com/created-event': {}, 'f:usedcapacity': {}}, 'f:labels': {'.': {}, 'f:isf.ibm.com/clusterId': {}}}, 'f:spec': {'.': {}, 'f:enableDR': {}, 'f:includedNamespaces': {}}}, 'manager': 'manager', 'operation': 'Update', 'time': '2024-02-26T11:54:46Z'}], 'name': 'clusters-hcp24feb', 'namespace': 'ibm-spectrum-fusion-ns', 'resourceVersion': '9215010', 'uid': '1b79fa23-5b73-4a46-806f-73f3dad59db3'}, 'spec': {'enableDR': False, 'includedNamespaces': ['clusters-hcp24feb']}, 'status': {'activeRestore': {}, 'availableBackups': 0, 'backupStatus': [], 'drStatus': {'initialSyncComplete': False}, 'lastSuccessfulBackupTime': 0, 'observedGeneration': 1, 'pvcStatuses': [{'namespace': 'clusters-hcp24feb', 'pvcName': 'data-etcd-0', 'pvcStatus': 'Bound'}, {'namespace': 'clusters-hcp24feb', 'pvcName': 'data-etcd-1', 'pvcStatus': 'Bound'}, {'namespace': 'clusters-hcp24feb', 'pvcName': 'data-etcd-2', 'pvcStatus': 'Bound'}, {'namespace': 'clusters-hcp24feb', 'pvcName': 'hcp24feb-6febef8a-82k9b-rhcos', 'pvcStatus': 'Bound'}, {'namespace': 'clusters-hcp24feb', 'pvcName': 'hcp24feb-6febef8a-dpfqh-rhcos', 'pvcStatus': 'Bound'}, {'namespace': 'clusters-hcp24feb', 'pvcName': 'hcp24feb-6febef8a-lvzfh-rhcos', 'pvcStatus': 'Bound'}], 'storageClassCapacities': {'lvms-hcp-etcd': {'numPVCs': 3, 'totalCapacity': 25769803776, 'usedCapacity': 1884758016}, 'ocs-storagecluster-ceph-rbd': {'numPVCs': 3, 'totalCapacity': 103079215104, 'usedCapacity': 0}}, 'totalBackups': 0, 'totalCapacity': 128849018880, 'totalPVCCapacity': 128849018880, 'usedCapacity': 1884758016, 'usedPVCCapacity': 1884758016}}
2024-02-26T11:54:46.913938731Z [2024-02-26 11:54:46,913] kopf.objects         [DEBUG   ] [ibm-spectrum-fusion-ns/clusters-hcp24feb] Updating diff: (('change', ('metadata', 'annotations', 'usedcapacity'), '1884758016', '1884819456'),)
2024-02-26T11:54:46.913972995Z [2024-02-26 11:54:46,913] kopf.objects         [DEBUG   ] [ibm-spectrum-fusion-ns/clusters-hcp24feb] Handler 'on_metadata_annotations/metadata.annotations' is invoked.
2024-02-26T11:54:46.914303997Z [2024-02-26 11:54:46,914] root                 [DEBUG   ] Modified application metadata.annotations request received with name=clusters-hcp24feb, namespace=ibm-spectrum-fusion-ns, spec={'enableDR': False, 'includedNamespaces': ['clusters-hcp24feb']}, body={'apiVersion': 'application.isf.ibm.com/v1alpha1', 'kind': 'Application', 'metadata': {'annotations': {'isf.ibm.com/created-event': 'created', 'kopf.zalando.org/last-handled-configuration': '{"spec":{"enableDR":false,"includedNamespaces":["clusters-hcp24feb"]},"metadata":{"labels":{"isf.ibm.com/clusterId":"b8f4b4bf-b113-417a-8708-3fc1ce10d9ce"},"annotations":{"isf.ibm.com/created-event":"created","usedcapacity":"1884758016"},"uid":"1b79fa23-5b73-4a46-806f-73f3dad59db3"}}\n', 'usedcapacity': '1884819456'}, 'creationTimestamp': '2024-02-24T04:17:17Z', 'finalizers': ['application-controller'], 'generation': 1, 'labels': {'isf.ibm.com/clusterId': 'b8f4b4bf-b113-417a-8708-3fc1ce10d9ce'}}, 'manager': 'manager', 'operation': 'Update', 'time': '2024-02-26T11:54:46Z'}], 'name': 'clusters-hcp24feb', 'namespace': 'ibm-spectrum-fusion-ns', 'resourceVersion': '9215010', 'uid': '1b79fa23-5b73-4a46-806f-73f3dad59db3'}, 'spec': {'enableDR': False, 'includedNamespaces': ['clusters-hcp24feb']}, 'status': {'activeRestore': {}, 'availableBackups': 0, 'backupStatus': [], 'drStatus': {'initialSyncComplete': False}, 'lastSuccessfulBackupTime': 0, 'observedGeneration': 1, 'pvcStatuses': [{'namespace': 'clusters-hcp24feb', 'pvcName': 'data-etcd-0', 'pvcStatus': 'Bound'}, {'namespace': 'clusters-hcp24feb', 'pvcName': 'data-etcd-1', 'pvcStatus': 'Bound'}, {'namespace': 'clusters-hcp24feb', 'pvcName': 'data-etcd-2', 'pvcStatus': 'Bound'}, {'namespace': 'clusters-hcp24feb', 'pvcName': 'hcp24feb-6febef8a-82k9b-rhcos', 'pvcStatus': 'Bound'}, {'namespace': 'clusters-hcp24feb', 'pvcName': 'hcp24feb-6febef8a-dpfqh-rhcos', 'pvcStatus': 'Bound'}, {'namespace': 'clusters-hcp24feb', 'pvcName': 'hcp24feb-6febef8a-lvzfh-rhcos', 'pvcStatus': 'Bound'}], 'storageClassCapacities': {'lvms-hcp-etcd': {'numPVCs': 3, 'totalCapacity': 25769803776, 'usedCapacity': 1884758016}, 'ocs-storagecluster-ceph-rbd': {'numPVCs': 3, 'totalCapacity': 103079215104, 'usedCapacity': 0}}, 'totalBackups': 0, 'totalCapacity': 128849018880, 'totalPVCCapacity': 128849018880, 'usedCapacity': 1884758016, 'usedPVCCapacity': 1884758016}}
2024-02-26T11:54:46.914351195Z [2024-02-26 11:54:46,914] root                 [DEBUG   ] Received request to create application message.
2024-02-26T11:54:46.914422589Z [2024-02-26 11:54:46,914] root                 [DEBUG   ] Reading from application
2024-02-26T11:54:46.914462334Z [2024-02-26 11:54:46,914] root                 [INFO    ] Sending application discovery Kafka message for application ibm-spectrum-fusion-ns/clusters-hcp24feb
2024-02-26T11:54:46.914535311Z [2024-02-26 11:54:46,914] root                 [DEBUG   ] Kafka Message Sending: topic=application key={'uuid': 'b8f4b4bf-b113-417a-8708-3fc1ce10d9ce'} value={'action': 'MODIFY', 'clusterId': 'b8f4b4bf-b113-417a-8708-3fc1ce10d9ce', 'applicationName': 'ibm-spectrum-fusion-ns/clusters-hcp24feb', 'applicationId': '1b79fa23-5b73-4a46-806f-73f3dad59db3', 'annotations': "{'isf.ibm.com/created-event': 'created', 'usedcapacity': '1884819456'}", 'discoveryTimestamp': '2024-02-26T06:01:32Z', 'isDiscoveryCompleted': False}
2024-02-26T11:54:46.915049095Z [2024-02-26 11:54:46,914] kopf.objects         [INFO    ] [ibm-spectrum-fusion-ns/clusters-hcp24feb] Handler 'on_metadata_annotations/metadata.annotations' succeeded.
2024-02-26T11:54:46.915199538Z [2024-02-26 11:54:46,915] kopf.objects         [INFO    ] [ibm-spectrum-fusion-ns/clusters-hcp24feb] Updating is processed: 1 succeeded; 0 failed.
2024-02-26T11:54:46.915403771Z [2024-02-26 11:54:46,915] kopf.objects         [DEBUG   ] [ibm-spectrum-fusion-ns/clusters-hcp24feb] Patching with: {'metadata': {'annotations': {'kopf.zalando.org/last-handled-configuration': '{"spec":{"enableDR":false,"includedNamespaces":["clusters-hcp24feb"]},"metadata":{"labels":{"isf.ibm.com/clusterId":"b8f4b4bf-b113-417a-8708-3fc1ce10d9ce"},"annotations":{"isf.ibm.com/created-event":"created","usedcapacity":"1884819456"},"uid":"1b79fa23-5b73-4a46-806f-73f3dad59db3"}}\n'}}}
2024-02-26T11:54:47.043473198Z [2024-02-26 11:54:47,043] kopf.objects         [DEBUG   ] [ibm-spectrum-fusion-ns/clusters-hcp24feb] Something has changed, but we are not interested (the essence is the same).
2024-02-26T11:54:47.043518694Z [2024-02-26 11:54:47,043] kopf.objects         [DEBUG   ] [ibm-spectrum-fusion-ns/clusters-hcp24feb] Handling cycle is finished, waiting for new changes.
2024-02-26T11:54:49.302440211Z [2024-02-26 11:54:49,302] kopf.objects         [DEBUG   ] [ibm-spectrum-fusion-ns/clusters-hcp24feb] Something has changed, but we are not interested (the essence is the same).
2024-02-26T11:54:49.302489643Z [2024-02-26 11:54:49,302] kopf.objects         [DEBUG   ] [ibm-spectrum-fusion-ns/clusters-hcp24feb] Handling cycle is finished, waiting for new changes.
2024-02-26T11:55:20.707305716Z [2024-02-26 11:55:20,707] kopf.objects         [DEBUG   ] [ibm-spectrum-fusion-ns/cephdemo] Something has changed, but we are not interested (the essence is the same).
2024-02-26T11:55:20.707354387Z [2024-02-26 11:55:20,707] kopf.objects         [DEBUG   ] [ibm-spectrum-fusion-ns/cephdemo] Handling cycle is finished, waiting for new changes.
2024-02-26T11:55:26.120333469Z [2024-02-26 11:55:26,120] kopf.objects         [DEBUG   ] [ibm-spectrum-fusion-ns/cephdemo] Something has changed, but we are not interested (the essence is the same).
2024-02-26T11:55:26.120333469Z [2024-02-26 11:55:26,120] kopf.objects         [DEBUG   ] [ibm-spectrum-fusion-ns/cephdemo] Handling cycle is finished, waiting for new changes.
...probe repeats
2024-02-26T12:54:51.819996162Z [2024-02-26 12:54:51,819] kopf.objects         [DEBUG   ] [ibm-spectrum-fusion-ns/clusters-hcp24feb] Updating is in progress: {'apiVersion': 'application.isf.ibm.com/v1alpha1', 'kind': 'Application', 'metadata': {'annotations': {'isf.ibm.com/created-event': 'created', 'kopf.zalando.org/last-handled-configuration': '{"spec":{"enableDR":false,"includedNamespaces":["clusters-hcp24feb"]},"metadata":{"labels":{"isf.ibm.com/clusterId":"b8f4b4bf-b113-417a-8708-3fc1ce10d9ce"},"annotations":{"isf.ibm.com/created-event":"created","usedcapacity":"1884819456"},"uid":"1b79fa23-5b73-4a46-806f-73f3dad59db3"}}\n', 'usedcapacity': '1884651520'}, 'creationTimestamp': '2024-02-24T04:17:17Z', 'finalizers': ['application-controller'], 'generation': 1, 'labels': {'isf.ibm.com/clusterId': 'b8f4b4bf-b113-417a-8708-3fc1ce10d9ce'}, 'managedFields': [{'apiVersion': 'application.isf.ibm.com/v1alpha1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:annotations': {'f:kopf.zalando.org/last-handled-configuration': {}}, 'f:finalizers': {'.': {}, 'v:"application-controller"': {}}}}, 'manager': 'kopf', 'operation': 'Update', 'time': '2024-02-26T11:54:46Z'}, {'apiVersion': 'application.isf.ibm.com/v1alpha1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:status': {'.': {}, 'f:activeRestore': {}, 'f:availableBackups': {}, 'f:backupStatus': {}, 'f:drStatus': {'.': {}, 'f:initialSyncComplete': {}}, 'f:lastSuccessfulBackupTime': {}, 'f:observedGeneration': {}, 'f:pvcStatuses': {}, 'f:storageClassCapacities': {'.': {}, 'f:lvms-hcp-etcd': {'.': {}, 'f:numPVCs': {}, 'f:totalCapacity': {}, 'f:usedCapacity': {}}, 'f:ocs-storagecluster-ceph-rbd': {'.': {}, 'f:numPVCs': {}, 'f:totalCapacity': {}, 'f:usedCapacity': {}}}, 'f:totalBackups': {}, 'f:totalCapacity': {}, 'f:totalPVCCapacity': {}, 'f:usedCapacity': {}, 'f:usedPVCCapacity': {}}}, 'manager': 'manager', 'operation': 'Update', 'subresource': 'status', 'time': '2024-02-26T11:54:49Z'}, {'apiVersion': 'application.isf.ibm.com/v1alpha1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:annotations': {'.': {}, 'f:isf.ibm.com/created-event': {}, 'f:usedcapacity': {}}, 'f:labels': {'.': {}, 'f:isf.ibm.com/clusterId': {}}}, 'f:spec': {'.': {}, 'f:enableDR': {}, 'f:includedNamespaces': {}}}, 'manager': 'manager', 'operation': 'Update', 'time': '2024-02-26T12:54:51Z'}], 'name': 'clusters-hcp24feb', 'namespace': 'ibm-spectrum-fusion-ns', 'resourceVersion': '9354249', 'uid': '1b79fa23-5b73-4a46-806f-73f3dad59db3'}, 'spec': {'enableDR': False, 'includedNamespaces': ['clusters-hcp24feb']}, 'status': {'activeRestore': {}, 'availableBackups': 0, 'backupStatus': [], 'drStatus': {'initialSyncComplete': False}, 'lastSuccessfulBackupTime': 0, 'observedGeneration': 1, 'pvcStatuses': [{'namespace': 'clusters-hcp24feb', 'pvcName': 'data-etcd-0', 'pvcStatus': 'Bound'}, {'namespace': 'clusters-hcp24feb', 'pvcName': 'data-etcd-1', 'pvcStatus': 'Bound'}, {'namespace': 'clusters-hcp24feb', 'pvcName': 'data-etcd-2', 'pvcStatus': 'Bound'}, {'namespace': 'clusters-hcp24feb', 'pvcName': 'hcp24feb-6febef8a-82k9b-rhcos', 'pvcStatus': 'Bound'}, {'namespace': 'clusters-hcp24feb', 'pvcName': 'hcp24feb-6febef8a-dpfqh-rhcos', 'pvcStatus': 'Bound'}, {'namespace': 'clusters-hcp24feb', 'pvcName': 'hcp24feb-6febef8a-lvzfh-rhcos', 'pvcStatus': 'Bound'}], 'storageClassCapacities': {'lvms-hcp-etcd': {'numPVCs': 3, 'totalCapacity': 25769803776, 'usedCapacity': 1884819456}, 'ocs-storagecluster-ceph-rbd': {'numPVCs': 3, 'totalCapacity': 103079215104, 'usedCapacity': 0}}, 'totalBackups': 0, 'totalCapacity': 128849018880, 'totalPVCCapacity': 128849018880, 'usedCapacity': 1884819456, 'usedPVCCapacity': 1884819456}}
2024-02-26T12:54:51.819996162Z [2024-02-26 12:54:51,819] kopf.objects         [DEBUG   ] [ibm-spectrum-fusion-ns/clusters-hcp24feb] Updating diff: (('change', ('metadata', 'annotations', 'usedcapacity'), '1884819456', '1884651520'),)
2024-02-26T12:54:51.820076553Z [2024-02-26 12:54:51,820] kopf.objects         [DEBUG   ] [ibm-spectrum-fusion-ns/clusters-hcp24feb] Handler 'on_metadata_annotations/metadata.annotations' is invoked.
2024-02-26T12:54:51.820634761Z [2024-02-26 12:54:51,820] root                 [DEBUG   ] Modified application metadata.annotations request received with name=clusters-hcp24feb, namespace=ibm-spectrum-fusion-ns, spec={'enableDR': False, 'includedNamespaces': ['clusters-hcp24feb']}, body={'apiVersion': 'application.isf.ibm.com/v1alpha1', 'kind': 'Application', 'metadata': {'annotations': {'isf.ibm.com/created-event': 'created', 'kopf.zalando.org/last-handled-configuration': '{"spec":{"enableDR":false,"includedNamespaces":["clusters-hcp24feb"]},"metadata":{"labels":{"isf.ibm.com/clusterId":"b8f4b4bf-b113-417a-8708-3fc1ce10d9ce"},"annotations":{"isf.ibm.com/created-event":"created","usedcapacity":"1884819456"},"uid":"1b79fa23-5b73-4a46-806f-73f3dad59db3"}}\n', 'usedcapacity': '1884651520'}, 'creationTimestamp': '2024-02-24T04:17:17Z', 'finalizers': ['application-controller'], 'generation': 1, 'labels': {'isf.ibm.com/clusterId': 'b8f4b4bf-b113-417a-8708-3fc1ce10d9ce'}, 'managedFields': [{'apiVersion': 'application.isf.ibm.com/v1alpha1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:annotations': {'f:kopf.zalando.org/last-handled-configuration': {}}, 'f:finalizers': {'.': {}, 'v:"application-controller"': {}}}}, 'manager': 'kopf', 'operation': 'Update', 'time': '2024-02-26T11:54:46Z'}, {'apiVersion': 'application.isf.ibm.com/v1alpha1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:status': {'.': {}, 'f:activeRestore': {}, 'f:availableBackups': {}, 'f:backupStatus': {}, 'f:drStatus': {'.': {}, 'f:initialSyncComplete': {}}, 'f:lastSuccessfulBackupTime': {}, 'f:observedGeneration': {}, 'f:pvcStatuses': {}, 'f:storageClassCapacities': {'.': {}, 'f:lvms-hcp-etcd': {'.': {}, 'f:numPVCs': {}, 'f:totalCapacity': {}, 'f:usedCapacity': {}}, 'f:ocs-storagecluster-ceph-rbd': {'.': {}, 'f:numPVCs': {}, 'f:totalCapacity': {}, 'f:usedCapacity': {}}}, 'f:totalBackups': {}, 'f:totalCapacity': {}, 'f:totalPVCCapacity': {}, 'f:usedCapacity': {}, 'f:usedPVCCapacity': {}}}, 'manager': 'manager', 'operation': 'Update', 'subresource': 'status', 'time': '2024-02-26T11:54:49Z'}, {'apiVersion': 'application.isf.ibm.com/v1alpha1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:annotations': {'.': {}, 'f:isf.ibm.com/created-event': {}, 'f:usedcapacity': {}}, 'f:labels': {'.': {}, 'f:isf.ibm.com/clusterId': {}}}, 'f:spec': {'.': {}, 'f:enableDR': {}, 'f:includedNamespaces': {}}}, 'manager': 'manager', 'operation': 'Update', 'time': '2024-02-26T12:54:51Z'}], 'name': 'clusters-hcp24feb', 'namespace': 'ibm-spectrum-fusion-ns', 'resourceVersion': '9354249', 'uid': '1b79fa23-5b73-4a46-806f-73f3dad59db3'}, 'spec': {'enableDR': False, 'includedNamespaces': ['clusters-hcp24feb']}, 'status': {'activeRestore': {}, 'availableBackups': 0, 'backupStatus': [], 'drStatus': {'initialSyncComplete': False}, 'lastSuccessfulBackupTime': 0, 'observedGeneration': 1, 'pvcStatuses': [{'namespace': 'clusters-hcp24feb', 'pvcName': 'data-etcd-0', 'pvcStatus': 'Bound'}, {'namespace': 'clusters-hcp24feb', 'pvcName': 'data-etcd-1', 'pvcStatus': 'Bound'}, {'namespace': 'clusters-hcp24feb', 'pvcName': 'data-etcd-2', 'pvcStatus': 'Bound'}, {'namespace': 'clusters-hcp24feb', 'pvcName': 'hcp24feb-6febef8a-82k9b-rhcos', 'pvcStatus': 'Bound'}, {'namespace': 'clusters-hcp24feb', 'pvcName': 'hcp24feb-6febef8a-dpfqh-rhcos', 'pvcStatus': 'Bound'}, {'namespace': 'clusters-hcp24feb', 'pvcName': 'hcp24feb-6febef8a-lvzfh-rhcos', 'pvcStatus': 'Bound'}], 'storageClassCapacities': {'lvms-hcp-etcd': {'numPVCs': 3, 'totalCapacity': 25769803776, 'usedCapacity': 1884819456}, 'ocs-storagecluster-ceph-rbd': {'numPVCs': 3, 'totalCapacity': 103079215104, 'usedCapacity': 0}}, 'totalBackups': 0, 'totalCapacity': 128849018880, 'totalPVCCapacity': 128849018880, 'usedCapacity': 1884819456, 'usedPVCCapacity': 1884819456}}
2024-02-26T12:54:51.820691968Z [2024-02-26 12:54:51,820] root                 [DEBUG   ] Received request to create application message.
2024-02-26T12:54:51.820752341Z [2024-02-26 12:54:51,820] root                 [DEBUG   ] Reading from application
2024-02-26T12:54:51.820794751Z [2024-02-26 12:54:51,820] root                 [INFO    ] Sending application discovery Kafka message for application ibm-spectrum-fusion-ns/clusters-hcp24feb
2024-02-26T12:54:51.820881864Z [2024-02-26 12:54:51,820] root                 [DEBUG   ] Kafka Message Sending: topic=application key={'uuid': 'b8f4b4bf-b113-417a-8708-3fc1ce10d9ce'} value={'action': 'MODIFY', 'clusterId': 'b8f4b4bf-b113-417a-8708-3fc1ce10d9ce', 'applicationName': 'ibm-spectrum-fusion-ns/clusters-hcp24feb', 'applicationId': '1b79fa23-5b73-4a46-806f-73f3dad59db3', 'annotations': "{'isf.ibm.com/created-event': 'created', 'usedcapacity': '1884651520'}", 'discoveryTimestamp': '2024-02-26T06:01:32Z', 'isDiscoveryCompleted': False}
2024-02-26T12:54:51.821499163Z [2024-02-26 12:54:51,821] kopf.objects         [INFO    ] [ibm-spectrum-fusion-ns/clusters-hcp24feb] Handler 'on_metadata_annotations/metadata.annotations' succeeded.
2024-02-26T12:54:51.821668912Z [2024-02-26 12:54:51,821] kopf.objects         [INFO    ] [ibm-spectrum-fusion-ns/clusters-hcp24feb] Updating is processed: 1 succeeded; 0 failed.
2024-02-26T12:54:51.821886440Z [2024-02-26 12:54:51,821] kopf.objects         [DEBUG   ] [ibm-spectrum-fusion-ns/clusters-hcp24feb] Patching with: {'metadata': {'annotations': {'kopf.zalando.org/last-handled-configuration': '{"spec":{"enableDR":false,"includedNamespaces":["clusters-hcp24feb"]},"metadata":{"labels":{"isf.ibm.com/clusterId":"b8f4b4bf-b113-417a-8708-3fc1ce10d9ce"},"annotations":{"isf.ibm.com/created-event":"created","usedcapacity":"1884651520"},"uid":"1b79fa23-5b73-4a46-806f-73f3dad59db3"}}\n'}}}
2024-02-26T12:54:51.989526589Z [2024-02-26 12:54:51,989] kopf.objects         [DEBUG   ] [ibm-spectrum-fusion-ns/clusters-hcp24feb] Something has changed, but we are not interested (the essence is the same).
2024-02-26T12:54:51.989583606Z [2024-02-26 12:54:51,989] kopf.objects         [DEBUG   ] [ibm-spectrum-fusion-ns/clusters-hcp24feb] Handling cycle is finished, waiting for new changes.
2024-02-26T12:54:53.325395344Z [2024-02-26 12:54:53,325] kopf.objects         [DEBUG   ] [ibm-spectrum-fusion-ns/clusters-hcp24feb] Something has changed, but we are not interested (the essence is the same).
2024-02-26T12:54:53.325395344Z [2024-02-26 12:54:53,325] kopf.objects         [DEBUG   ] [ibm-spectrum-fusion-ns/clusters-hcp24feb] Handling cycle is finished, waiting for new changes.
... probe repeats
2024-02-26T13:42:31.312701197Z [2024-02-26 13:42:31,312] aiohttp.access       [INFO    ] 10.128.2.2 [26/Feb/2024:13:42:29 +0000] "GET /healthz HTTP/1.1" 200 245 "-" "kube-probe/1.27"
2024-02-26T13:42:31.312890452Z [2024-02-26 13:42:31,312] aiohttp.access       [INFO    ] 10.128.2.2 [26/Feb/2024:13:42:29 +0000] "GET /healthz HTTP/1.1" 200 245 "-" "kube-probe/1.27"
2024-02-26T13:42:59.685426651Z [2024-02-26 13:42:59,685] kafka.client         [WARNING ] <BrokerConnection node_id=0 host=guardian-kafka-cluster-kafka-0.guardian-kafka-cluster-kafka-brokers.ibm-backup-restore.svc:9093 <connected> [IPv4 ('10.129.2.76', 9093)]> timed out after 30000 ms. Closing connection.
2024-02-26T13:42:59.685483998Z [2024-02-26 13:42:59,685] kafka.conn           [INFO    ] <BrokerConnection node_id=0 host=guardian-kafka-cluster-kafka-0.guardian-kafka-cluster-kafka-brokers.ibm-backup-restore.svc:9093 <connected> [IPv4 ('10.129.2.76', 9093)]>: Closing connection. [Error 7] RequestTimedOutError: Request timed out after 30000 ms
2024-02-26T13:42:59.685901943Z [2024-02-26 13:42:59,685] kafka.client         [WARNING ] Node 0 connection failed -- refreshing metadata
2024-02-26T13:42:59.842812784Z [2024-02-26 13:42:59,842] kafka.conn           [INFO    ] <BrokerConnection node_id=0 host=guardian-kafka-cluster-kafka-0.guardian-kafka-cluster-kafka-brokers.ibm-backup-restore.svc:9093 <connecting> [IPv4 ('10.129.2.76', 9093)]>: connecting to guardian-kafka-cluster-kafka-0.guardian-kafka-cluster-kafka-brokers.ibm-backup-restore.svc:9093 [('10.129.2.76', 9093) IPv4]
2024-02-26T13:43:29.620885986Z [2024-02-26 13:43:29,620] kopf.activities.prob [DEBUG   ] Activity 'check_kafka_health' is invoked.
2024-02-26T13:43:29.853219419Z [2024-02-26 13:43:29,852] kafka.conn           [ERROR   ] Connection attempt to <BrokerConnection node_id=0 host=guardian-kafka-cluster-kafka-0.guardian-kafka-cluster-kafka-brokers.ibm-backup-restore.svc:9093 <connecting> [IPv4 ('10.129.2.76', 9093)]> timed out
2024-02-26T13:43:29.853219419Z [2024-02-26 13:43:29,852] kafka.conn           [INFO    ] <BrokerConnection node_id=0 host=guardian-kafka-cluster-kafka-0.guardian-kafka-cluster-kafka-brokers.ibm-backup-restore.svc:9093 <connecting> [IPv4 ('10.129.2.76', 9093)]>: Closing connection. KafkaConnectionError: timeout
2024-02-26T13:43:29.853219419Z [2024-02-26 13:43:29,852] kafka.client         [WARNING ] Node 0 connection failed -- refreshing metadata
2024-02-26T13:43:29.907597266Z [2024-02-26 13:43:29,907] kafka.conn           [INFO    ] <BrokerConnection node_id=1 host=guardian-kafka-cluster-kafka-1.guardian-kafka-cluster-kafka-brokers.ibm-backup-restore.svc:9093 <connecting> [IPv4 ('10.129.2.77', 9093)]>: connecting to guardian-kafka-cluster-kafka-1.guardian-kafka-cluster-kafka-brokers.ibm-backup-restore.svc:9093 [('10.129.2.77', 9093) IPv4]
2024-02-26T13:43:30.950918274Z [2024-02-26 13:43:30,950] kafka.conn           [INFO    ] <BrokerConnection node_id=1 host=guardian-kafka-cluster-kafka-1.guardian-kafka-cluster-kafka-brokers.ibm-backup-restore.svc:9093 <handshake> [IPv4 ('10.129.2.77', 9093)]>: Connection complete.
2024-02-26T13:43:38.870237767Z [2024-02-26 13:43:38,870] kopf._cogs.clients.w [DEBUG   ] Stopping the watch-stream for applications.v1alpha1.application.isf.ibm.com cluster-wide.
2024-02-26T13:43:38.872325055Z [2024-02-26 13:43:38,870] kopf._core.reactor.o [ERROR   ] Watcher for applications.v1alpha1.application.isf.ibm.com@none has failed: (None, None)
2024-02-26T13:43:38.872325055Z Traceback (most recent call last):
2024-02-26T13:43:38.872325055Z   File "/usr/src/app/code/.venv/lib64/python3.11/site-packages/kopf/_cogs/clients/errors.py", line 148, in check_response
2024-02-26T13:43:38.872325055Z     response.raise_for_status()
2024-02-26T13:43:38.872325055Z   File "/usr/src/app/code/.venv/lib64/python3.11/site-packages/aiohttp/client_reqrep.py", line 1060, in raise_for_status
2024-02-26T13:43:38.872325055Z     raise ClientResponseError(
2024-02-26T13:43:38.872325055Z aiohttp.client_exceptions.ClientResponseError: 429, message='Too Many Requests', url=URL('https://172.30.0.1:443/apis/application.isf.ibm.com/v1alpha1/applications?watch=true&resourceVersion=9409415&timeoutSeconds=600')
2024-02-26T13:43:38.872325055Z 
2024-02-26T13:43:38.872325055Z The above exception was the direct cause of the following exception:
2024-02-26T13:43:38.872325055Z 
2024-02-26T13:43:38.872325055Z Traceback (most recent call last):
2024-02-26T13:43:38.872325055Z   File "/usr/src/app/code/.venv/lib64/python3.11/site-packages/kopf/_cogs/aiokits/aiotasks.py", line 96, in guard
2024-02-26T13:43:38.872325055Z     await coro
2024-02-26T13:43:38.872325055Z   File "/usr/src/app/code/.venv/lib64/python3.11/site-packages/kopf/_core/reactor/queueing.py", line 175, in watcher
2024-02-26T13:43:38.872325055Z     async for raw_event in stream:
2024-02-26T13:43:38.872325055Z   File "/usr/src/app/code/.venv/lib64/python3.11/site-packages/kopf/_cogs/clients/watching.py", line 86, in infinite_watch
2024-02-26T13:43:38.872325055Z     async for raw_event in stream:
2024-02-26T13:43:38.872325055Z   File "/usr/src/app/code/.venv/lib64/python3.11/site-packages/kopf/_cogs/clients/watching.py", line 201, in continuous_watch
2024-02-26T13:43:38.872325055Z     async for raw_input in stream:
2024-02-26T13:43:38.872325055Z   File "/usr/src/app/code/.venv/lib64/python3.11/site-packages/kopf/_cogs/clients/watching.py", line 266, in watch_objs
2024-02-26T13:43:38.872325055Z     async for raw_input in api.stream(
2024-02-26T13:43:38.872325055Z   File "/usr/src/app/code/.venv/lib64/python3.11/site-packages/kopf/_cogs/clients/api.py", line 200, in stream
2024-02-26T13:43:38.872325055Z     response = await request(
2024-02-26T13:43:38.872325055Z                ^^^^^^^^^^^^^^
2024-02-26T13:43:38.872325055Z   File "/usr/src/app/code/.venv/lib64/python3.11/site-packages/kopf/_cogs/clients/auth.py", line 45, in wrapper
2024-02-26T13:43:38.872325055Z     return await fn(*args, **kwargs, context=context)
2024-02-26T13:43:38.872325055Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-02-26T13:43:38.872325055Z   File "/usr/src/app/code/.venv/lib64/python3.11/site-packages/kopf/_cogs/clients/api.py", line 85, in request
2024-02-26T13:43:38.872325055Z     await errors.check_response(response)  # but do not parse it!
2024-02-26T13:43:38.872325055Z     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-02-26T13:43:38.872325055Z   File "/usr/src/app/code/.venv/lib64/python3.11/site-packages/kopf/_cogs/clients/errors.py", line 150, in check_response
2024-02-26T13:43:38.872325055Z     raise cls(payload, status=response.status) from e
2024-02-26T13:43:38.872325055Z kopf._cogs.clients.errors.APIClientError: (None, None)
2024-02-26T13:44:16.563238107Z [2024-02-26 13:44:16,563] root                 [INFO    ] Probe response True, {'connected': True}
2024-02-26T13:44:16.564237153Z [2024-02-26 13:44:16,564] kopf.activities.prob [INFO    ] Activity 'check_kafka_health' succeeded.
2024-02-26T13:44:16.564269573Z [2024-02-26 13:44:16,564] kopf.activities.prob [DEBUG   ] Activity 'check_rbacs' is invoked.
...probe repeats

Additional information

No response

@msfrucht msfrucht added the bug Something isn't working label Feb 27, 2024
@teriyakichild
Copy link

I'm seeing the same issue on 1.36.2.

@dheeg
Copy link

dheeg commented Aug 1, 2024

I see the same problem. Once an Watcher for xyz has failed error occured the game is basically over. There can be other errors leading to the same situation #1077

It would be great if one of the following could be achieved:

  • Retry and recover automatically, using the configured backoff
  • Liveness probe takes the watchers into consideration
  • We could somehow monitor the internal watchers ourself using a custom logic

@cpnielsen
Copy link

@nolar Any input / update on this? If you know where the potential issue may be, can you give some pointers for me or someone else to potentially fix?

@valen7in
Copy link

valen7in commented Nov 12, 2024

Same issues starting kopf with namespace wildcard scope:

kopf run --namespace=dev-*

Having kopf handlers on N resources (mostly CRD) and M namespaces, it seems like kopf tries to immediately start N x M listeners resulting in 429 from K8s API for most of them. kopf is running, but handlers for most resources are not fired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants