Skip to content

Commit

Permalink
feat(api,ui,sdk): Make CPU limits configurable (#586)
Browse files Browse the repository at this point in the history
# Description
As of present, users are not able to configure the CPU limits of the
pods in which Merlin models and transformers are deployed in - they are
instead determined automatically on the platform-level (Merlin API
server). Depending on how the API server has been configured, one of the
following happens:
- the CPU limit of a model is set as its CPU request value, multiplied
by a [scaling
factor](https://github.com/caraml-dev/merlin/blob/f1ebe099ea168988b365ee72ce08543b127826e1/api/config/config.go#L364)
(e.g. 2 CPU * 1.5) **or,**
- Note that this is the existing way memory limits are automatically set
by the Merlin API server
- the CPU limit is left unset 
- Note that because KServe does not currently allow CPU limits to be
completely unset, the Merlin API server instead sets an [arbitrary value
](https://github.com/caraml-dev/merlin/blob/f1ebe099ea168988b365ee72ce08543b127826e1/api/config/config.go#L363)(ideally
one that is very big) as the CPU limit instead

This PR introduces a new workflow which would allow users to instead
override the platform-level CPU limits (described in the paragraph
above) set on a model. This workflow is available via the UI, SDK and by
extension, directly calling the API endpoint of the API server.

UI:
![Screenshot 2024-05-24 at 2 13
46 PM](https://github.com/caraml-dev/merlin/assets/36802364/a2b59c1e-df2d-4070-92ff-b4f375256da1)

![Screenshot 2024-05-24 at 2 23
42 PM](https://github.com/caraml-dev/merlin/assets/36802364/616f0d03-0d36-4b82-8dd5-051d098a78c2)

SDK:
```python
merlin.deploy(
    version_1,
    resource_request=merlin.ResourceRequest(
        min_replica=0,
        max_replica=0,
        cpu_request="0.5",
        cpu_limit="2",
        memory_request="1Gi",
    ),
)
```

In addition, this PR adds a new configuration,
`DefaultEnvVarsWithoutCPULimits`, which is a list of env vars that
automatically get added to all Merlin models and transformers when CPU
limits are not set. This allows the Merlin API server's operators to set
env vars platform-wide that can potentially improve these deployments'
performance, e.g. env vars involving concurrency.

# Modifications
- `api/cluster/resource/templater.go` - Refactoring of templater methods
to set default env vars when cpu limits are not explicitly set and when
the cpu limit scaling factor is set as 0
- `api/config/config.go` - Addition of the new field
`DefaultEnvVarsWithoutCPULimits`
- `api/config/config_test.go` - Addition of a new unit test to test the
parsing of configs from .yaml files
- `docs/user/templates/model_deployment/01_deploying_a_model_version.md`
- Addition of docs to demonstrate how the platform-level CPU limits can
be overriden
- `python/sdk/merlin/resource_request.py` - Addition of a new cpu limit
field to the resource request class
-
`ui/src/pages/version/components/forms/components/CPULimitsFormGroup.js`
- Addition of a new form group to allow cpu limits to be specified on
the UI

# Tests
- [x] Deploying existing models (and transformers) with and without CPU
limits set

# Checklist
- [x] Added PR label
- [x] Added unit test, integration, and/or e2e tests
- [x] Tested locally
- [x] Updated documentation
- [x] Update Swagger spec if the PR introduce API changes
- [x] Regenerated Golang and Python client if the PR introduces API
changes

# Release Notes
<!--
Does this PR introduce a user-facing change?
If no, just write "NONE" in the release-note block below.
If yes, a release note is required. Enter your extended release note in
the block below.
If the PR requires additional action from users switching to the new
release, include the string "action required".

For more information about release notes, see kubernetes' guide here:
http://git.k8s.io/community/contributors/guide/release-notes.md
-->

```release-note
NONE
```
  • Loading branch information
deadlycoconuts authored Jun 3, 2024
1 parent d948254 commit c05af1d
Show file tree
Hide file tree
Showing 39 changed files with 1,310 additions and 184 deletions.
5 changes: 5 additions & 0 deletions .github/workflows/codecov-config/codecov.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
coverage:
status:
patch:
default:
threshold: 0.03%
2 changes: 2 additions & 0 deletions .github/workflows/merlin.yml
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,7 @@ jobs:
name: sdk-test-${{ matrix.python-version }}
token: ${{ secrets.CODECOV_TOKEN }}
working-directory: ./python/sdk
codecov_yml_path: ../../.github/workflows/codecov-config/codecov.yml

lint-api:
runs-on: ubuntu-latest
Expand Down Expand Up @@ -191,6 +192,7 @@ jobs:
name: api-test
token: ${{ secrets.CODECOV_TOKEN }}
working-directory: ./api
codecov_yml_path: ../.github/workflows/codecov-config/codecov.yml

test-observation-publisher:
runs-on: ubuntu-latest
Expand Down
36 changes: 36 additions & 0 deletions api/client/model_resource_request.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

98 changes: 58 additions & 40 deletions api/cluster/resource/templater.go
Original file line number Diff line number Diff line change
Expand Up @@ -192,22 +192,11 @@ func (t *InferenceServiceTemplater) CreateInferenceServiceSpec(modelService *mod
}

func (t *InferenceServiceTemplater) createPredictorSpec(modelService *models.Service) (kservev1beta1.PredictorSpec, error) {
envVars := modelService.EnvVars

// Set resource limits to request * userContainerCPULimitRequestFactor or UserContainerMemoryLimitRequestFactor
limits := map[corev1.ResourceName]resource.Quantity{}
if t.deploymentConfig.UserContainerCPULimitRequestFactor != 0 {
limits[corev1.ResourceCPU] = ScaleQuantity(
modelService.ResourceRequest.CPURequest, t.deploymentConfig.UserContainerCPULimitRequestFactor,
)
} else {
// TODO: Remove this else-block when KServe finally allows default CPU limits to be removed
var err error
limits[corev1.ResourceCPU], err = resource.ParseQuantity(t.deploymentConfig.UserContainerCPUDefaultLimit)
if err != nil {
return kservev1beta1.PredictorSpec{}, err
}
limits, envVars, err := t.getResourceLimitsAndEnvVars(modelService.ResourceRequest, modelService.EnvVars)
if err != nil {
return kservev1beta1.PredictorSpec{}, err
}

if t.deploymentConfig.UserContainerMemoryLimitRequestFactor != 0 {
limits[corev1.ResourceMemory] = ScaleQuantity(
modelService.ResourceRequest.MemoryRequest, t.deploymentConfig.UserContainerMemoryLimitRequestFactor,
Expand Down Expand Up @@ -329,7 +318,7 @@ func (t *InferenceServiceTemplater) createPredictorSpec(modelService *models.Ser
// 1. PyFunc default env
// 2. User environment variable
// 3. Default env variable that can be override by user environment
higherPriorityEnvVars := models.MergeEnvVars(modelService.EnvVars, pyfuncDefaultEnv)
higherPriorityEnvVars := models.MergeEnvVars(envVars, pyfuncDefaultEnv)
lowerPriorityEnvVars := models.EnvVars{}
if modelService.Protocol == protocol.UpiV1 {
lowerPriorityEnvVars = append(lowerPriorityEnvVars, models.EnvVar{Name: envGRPCOptions, Value: t.deploymentConfig.PyfuncGRPCOptions})
Expand Down Expand Up @@ -364,7 +353,7 @@ func (t *InferenceServiceTemplater) createPredictorSpec(modelService *models.Ser
}

case models.ModelTypeCustom:
predictorSpec = createCustomPredictorSpec(modelService, resources, nodeSelector, tolerations)
predictorSpec = createCustomPredictorSpec(modelService, envVars, resources, nodeSelector, tolerations)
}

if len(nodeSelector) > 0 {
Expand Down Expand Up @@ -392,28 +381,11 @@ func (t *InferenceServiceTemplater) createTransformerSpec(
modelService *models.Service,
transformer *models.Transformer,
) (*kservev1beta1.TransformerSpec, error) {
// Set resource limits to request * userContainerCPULimitRequestFactor or UserContainerMemoryLimitRequestFactor
limits := map[corev1.ResourceName]resource.Quantity{}
if t.deploymentConfig.UserContainerCPULimitRequestFactor != 0 {
limits[corev1.ResourceCPU] = ScaleQuantity(
transformer.ResourceRequest.CPURequest, t.deploymentConfig.UserContainerCPULimitRequestFactor,
)
} else {
// TODO: Remove this else-block when KServe finally allows default CPU limits to be removed
var err error
limits[corev1.ResourceCPU], err = resource.ParseQuantity(t.deploymentConfig.UserContainerCPUDefaultLimit)
if err != nil {
return nil, err
}
}
if t.deploymentConfig.UserContainerMemoryLimitRequestFactor != 0 {
limits[corev1.ResourceMemory] = ScaleQuantity(
transformer.ResourceRequest.MemoryRequest, t.deploymentConfig.UserContainerMemoryLimitRequestFactor,
)
limits, envVars, err := t.getResourceLimitsAndEnvVars(transformer.ResourceRequest, transformer.EnvVars)
if err != nil {
return nil, err
}

envVars := transformer.EnvVars

// Put in defaults if not provided by users (user's input is used)
if transformer.TransformerType == models.StandardTransformerType {
transformer.Image = t.deploymentConfig.StandardTransformer.ImageName
Expand Down Expand Up @@ -780,9 +752,13 @@ func createDefaultPredictorEnvVars(modelService *models.Service) models.EnvVars
return defaultEnvVars
}

func createCustomPredictorSpec(modelService *models.Service, resources corev1.ResourceRequirements, nodeSelector map[string]string, tolerations []corev1.Toleration) kservev1beta1.PredictorSpec {
envVars := modelService.EnvVars

func createCustomPredictorSpec(
modelService *models.Service,
envVars models.EnvVars,
resources corev1.ResourceRequirements,
nodeSelector map[string]string,
tolerations []corev1.Toleration,
) kservev1beta1.PredictorSpec {
// Add default env var (Overwrite by user not allowed)
defaultEnvVar := createDefaultPredictorEnvVars(modelService)
envVars = models.MergeEnvVars(envVars, defaultEnvVar)
Expand Down Expand Up @@ -910,3 +886,45 @@ func (t *InferenceServiceTemplater) applyDefaults(service *models.Service) {
}
}
}

func (t *InferenceServiceTemplater) getResourceLimitsAndEnvVars(
resourceRequest *models.ResourceRequest,
envVars models.EnvVars,
) (map[corev1.ResourceName]resource.Quantity, models.EnvVars, error) {
// Set resource limits to request * userContainerCPULimitRequestFactor or UserContainerMemoryLimitRequestFactor
limits := map[corev1.ResourceName]resource.Quantity{}
// Set cpu resource limits automatically if they have not been set
if resourceRequest.CPULimit == nil || resourceRequest.CPULimit.IsZero() {
if t.deploymentConfig.UserContainerCPULimitRequestFactor != 0 {
limits[corev1.ResourceCPU] = ScaleQuantity(
resourceRequest.CPURequest, t.deploymentConfig.UserContainerCPULimitRequestFactor,
)
} else {
// TODO: Remove this else-block when KServe finally allows default CPU limits to be removed
var err error
limits[corev1.ResourceCPU], err = resource.ParseQuantity(t.deploymentConfig.UserContainerCPUDefaultLimit)
if err != nil {
return nil, nil, err
}
// Set additional env vars to manage concurrency so model performance improves when no CPU limits are set
envVars = models.MergeEnvVars(ParseEnvVars(t.deploymentConfig.DefaultEnvVarsWithoutCPULimits), envVars)
}
} else {
limits[corev1.ResourceCPU] = *resourceRequest.CPULimit
}

if t.deploymentConfig.UserContainerMemoryLimitRequestFactor != 0 {
limits[corev1.ResourceMemory] = ScaleQuantity(
resourceRequest.MemoryRequest, t.deploymentConfig.UserContainerMemoryLimitRequestFactor,
)
}
return limits, envVars, nil
}

func ParseEnvVars(envVars []corev1.EnvVar) models.EnvVars {
var parsedEnvVars models.EnvVars
for _, envVar := range envVars {
parsedEnvVars = append(parsedEnvVars, models.EnvVar{Name: envVar.Name, Value: envVar.Value})
}
return parsedEnvVars
}
2 changes: 1 addition & 1 deletion api/cluster/resource/templater_gpu_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ var (
"nvidia.com/gpu": resource.MustParse("1"),
},
Limits: corev1.ResourceList{
corev1.ResourceCPU: resource.MustParse("8"),
corev1.ResourceCPU: resource.MustParse("10"),
corev1.ResourceMemory: ScaleQuantity(defaultModelResourceRequests.MemoryRequest, 2),
"nvidia.com/gpu": resource.MustParse("1"),
},
Expand Down
Loading

0 comments on commit c05af1d

Please sign in to comment.