Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: pdb validity check & deletion (#599)
# Description ## Background When creating Pod Disruption Budget, Kubernetes gives two ways for user to specify the allowed disruption amount, use `minAvailable` or `maxUnavailable` . The example of behavior is as below: ``` current replica = 2 Use MinAvailable: min available = 80% allowed disruption = current replica - min available replica = 2 - ceil(80% * 2) = 2 - 2 = 0 Use MaxUnavailable: max unavailable = 20% allowed disruption = ceil(20% * 2) = ceil(0.4) = 1 ``` ## Issue ### 1. Miscalculation on PDB validity check Merlin do a validity check on the PDB configuration, if the configuration doesn’t allow for any disruption, then Merlin will not create the PDB, this to avoid Kubernetes can’t remove any replica because all replica must be up. When the configuration in Merlin use the `maxUnavailability`, the calculation will be ([ref1](https://github.com/caraml-dev/merlin/blob/8c4930dc2c3f3edf3f1a862a99dd3cb5730c50cb/api/cluster/pdb.go#L88) & [ref2](https://github.com/caraml-dev/merlin/blob/8c4930dc2c3f3edf3f1a862a99dd3cb5730c50cb/api/cluster/pdb.go#L94-L96)): ``` minPercentage = (100 - maxUnavailability)% if not (minPercentage * minReplica < minReplica) don't create PDB ``` Which doesn’t reflect the real calculation on Kubernetes. Ex: ``` replica: 5 maxUnavailability: 10% Kubernetes calculation: allowed disruption = ceil(10% * 5) = ceil(0.5) = 1 Merlin calculation: minPercentage = (100 - 10)% = 90% minAvailableReplica = ceil(90% * 5) = ceil(4.5) = 5 allowed disruption = 5 - 5 = 0 ``` And therefore in the example above, Merlin will not create any PDB configuration because it thinks that if the PDB is created, it might not allowed any pod to be disrupted. Merlin avoid this because in the event of node scale down, the pods can’t be removed and causing deadlock. But actually, the Kubernetes calculation does allow for 1 pod to be disrupted. ### 2. Unused PDB not deleted When a model is redeployed, it will create a new revision ID, and there's a chance that unused PDB for the previous revision is not deleted. Scenario: 1. Current state of model: has predictor & transformer PDB 2. User redeploy a new model version and this version doesn't have PDB 3. Newer model version is deployed, old unused PDB is not deleted (because Merlin doesn't check or delete if previously PDB exist) **Notes:** the old PDB will not affect the new model deployment, because in the `selector.matchLabels` there's a label of `inferenceservice:{modelName}-{versionID}-{revisionID}`, and the revisionID will be incremented on every new deployment. So it is safe to not delete PDB, but it will going to confuse user. # Modifications Changes: - Fix the validity checking of the PDB. The `minAvailable` will use the current behavior, but if Merlin config uses `maxUnavailable` it will create PDB as long as the number is bigger than 0. - Merlin will create PDB with a fix name of `{name}-{versionID}-{componentType}-pdb`, where `componentType` is either `predictor` or `transformer`. The changes in the code will compare those two fix names with the new PDBs name, then for name that doesn't exist in new PDBs name, delete it. # Tests <!-- Besides the existing / updated automated tests, what specific scenarios should be tested? Consider the backward compatibility of the changes, whether corner cases are covered, etc. Please describe the tests and check the ones that have been completed. Eg: - [x] Deploying new and existing standard models - [ ] Deploying PyFunc models --> # Checklist - [x] Added PR label - [x] Added unit test, integration, and/or e2e tests - [x] Tested locally - [ ] Updated documentation - [ ] Update Swagger spec if the PR introduce API changes - [ ] Regenerated Golang and Python client if the PR introduces API changes # Release Notes <!-- Does this PR introduce a user-facing change? If no, just write "NONE" in the release-note block below. If yes, a release note is required. Enter your extended release note in the block below. If the PR requires additional action from users switching to the new release, include the string "action required". For more information about release notes, see kubernetes' guide here: http://git.k8s.io/community/contributors/guide/release-notes.md --> ```release-note fix wrong validation check of PDB and delete unused PDB from previous version model ```
- Loading branch information