Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

configurable filter chains #169

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Kuromesi
Copy link

@Kuromesi Kuromesi commented Jan 8, 2025

Motivation

Recently, I have been conducting tests in my environment based on your efforts. I wanted to evaluate the performance of different filter logics, but it is quite time-consuming to modify the filters through source code changes, followed by rebuilding and redeploying the application. During this process, I also noticed two TODOs in the code that align well with my needs:

// TODO(https://github.com/kubernetes-sigs/llm-instance-gateway/issues/16) Make this configurable.
kvCacheThreshold = 0.8

and

// TODO: Compare this strategy with other strategies such as top K.
func leastKVCacheFilterFunc(req *LLMRequest, pods []*backend.PodMetrics) ([]*backend.PodMetrics, error)

As a result, I wrote some code to address these points and would like to know if it meets your approval.

Design

First, I renamed Filter interface to FilterChain and filterFunc to filter since I think this may better describe their functionality (Maybe this is not needed).

Then I added some interfaces.

Filters generation

FilterGen is responsible for generating filters with filter options by calling Get, and filter options should be validated before Get by calling Validate

// FilterGen generate a filter from a filter option
type FilterGen interface {
	Name() string
	Get(*FilterOption) filter
	Validate(*FilterOption) error
}

type FilterOption struct {
	KvCacheThreshold *float64 `json:"kvCacheThreshold,omitempty"`

	QueueThresholdCritical *int `json:"queueThresholdCritical,omitempty"`
	QueueingThresholdLoRA  *int `json:"queueingThresholdLoRA,omitempty"`

	// TopK defines the top k pods to be selected, if not specified, 3 will be used
	TopK *int `json:"topK,omitempty"`
	// TopKBy defines the metric to be used for top k selection,
	// `waiting_queue_size` and `kv_cache_usage_percent` are available
	TopKBy *string `json:"topKBy,omitempty"`
}

And I added a topk filter, though the performance of this filter have not been tested yet.

func filterTopKGetter(opt *FilterOption) filter {
        // defaults to 3
	k := DefaultTopK
	if opt.TopK != nil {
		k = *opt.TopK
	}
	var cmp func(a, b *backend.PodMetrics) bool
	switch *opt.TopKBy {
	case TopKByKVCacheUsagePercent:
		cmp = func(a, b *backend.PodMetrics) bool {
			return a.KVCacheUsagePercent > b.KVCacheUsagePercent
		}
	case TopKByWaitingQueueSize:
		cmp = func(a, b *backend.PodMetrics) bool {
			return a.WaitingQueueSize > b.WaitingQueueSize
		}
	}
	topk := algorithms.NewHeapTopK(cmp)
	return func(req *LLMRequest, pods []*backend.PodMetrics) ([]*backend.PodMetrics, error) {
		return topk.TopK(pods, k), nil
	}
}

Orchestrate filter chains

FilterOrchestrator is responsible to orchestrate a filter chain with FilterOrchestration configuration which is loaded from configmap for now (maybe use a CRD?).

type FilterOrchestrator interface {
	Orchestrate() FilterChain
}

type FilterOrchestration struct {
	Name                   string               `json:"name"`
	NextOnSuccess          *FilterOrchestration `json:"nextOnSuccess,omitempty"`
	NextOnFailure          *FilterOrchestration `json:"nextOnFailure,omitempty"`
	NextOnSuccessOrFailure *FilterOrchestration `json:"nextOnSuccessOrFailure,omitempty"`
	FilterOption           *FilterOption        `json:"filterOption,omitempty"`
}

I also added a reconciler for configmaps and store the configmap in the datastore. It is quite simple and need to be improved.

func (c *FilterConfigReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	if req.NamespacedName.Name != "filter-config" || req.NamespacedName.Namespace != "default" {
		return ctrl.Result{}, nil
	}
	cm := &corev1.ConfigMap{}
	if err := c.Get(ctx, req.NamespacedName, cm); err != nil {
		klog.Errorf("unable to get ConfigMap, err: %v", err)
		return ctrl.Result{}, err
	}
	klog.Infof("updating filter config to: %++v", cm.Data)
	c.Datastore.filterConfigMap = cm.DeepCopy()
	return ctrl.Result{}, nil
}

Configurable filter chains

By doing this, the scheduler can be initialized with a FilterOrchestrator and dynamically configure filters now.

func WithOrchestrator(orchestrator FilterOrchestrator) SchedulerOption {
	return func(s *Scheduler) {
		s.filterOrchestrator = orchestrator
	}
}

type SchedulerOption func(*Scheduler)

type Scheduler struct {
	podMetricsProvider PodMetricsProvider
	filter             FilterChain
	filterOrchestrator FilterOrchestrator
}
pods, err := s.filterOrchestrator.Orchestrate().Filter(req, s.podMetricsProvider.AllPodMetrics())

A configmap demo is shown in pkg/ext-proc/scheduling/orchestrate_test.go which can be orchestrated to a default filter.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Kuromesi
Once this PR has been reviewed and has the lgtm label, please assign smarterclayton for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 8, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @Kuromesi!

It looks like this is your first PR to kubernetes-sigs/gateway-api-inference-extension 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/gateway-api-inference-extension has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 8, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @Kuromesi. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jan 8, 2025
@kfswain
Copy link
Collaborator

kfswain commented Jan 8, 2025

Hey there @Kuromesi! Thanks for the contribution! This looks like a sizable PR, so it may take some time to review and digest. But I am looking at it. 👍🏻

@liu-cong
Copy link
Contributor

liu-cong commented Jan 8, 2025

Thanks for the contribution and many great ideas! A few high level thoughts:

  1. I think this PR can be broken down to 2 parts: 1. the topK filter 2. dynamically reconfigure the filters via a configmap backed FilterOrchestrator. It would be easier to review if you split this into at least 2 different PRs.
  2. For the topK filter, I did try a little bit before and didn't get good benchmark results. I didn't go further though so it's probably worth digging more. We do need to run some benchmarks to validate its effectiveness before we can merge it.
  3. A configmap based dynamic config is a great tool for dev/debug. But it adds both cognitive overhead and a runtime dependency. Just out of curiosity, do you find modifying the configmap easier than modifying the go code? The go code is a bit more verbose, but with IDE support you get autocompletion and error checks. I am afraid directly modifying the configmap text is error prone. Consider making configmap approach optional and not enabled by default is probably OK if people find it convenient. If we find it useful for operators to dynamically configure some knobs, then we can consider exposing those knobs to the InferencePool or a new CRD.
  4. I do agree that the filters are getting more complex. There is an issue to track if we can simply it Explore if we can simplify the filter tree in pkg/ext_proc/scheduling/scheduler.go #89

@kfswain
Copy link
Collaborator

kfswain commented Jan 8, 2025

I generally agree with @liu-cong here.

Especially point's 3 & 4.

I totally understand using a config map for rapid dev as you dont need to rebuild an image and can just update a config map to set the algo you want to test and iterate on. So from a development and experimentation standpoint I think that is very strong.

But from a supportability standpoint I think that could be problematic, for reasons that Cong mentioned. Debugging a live service that has its algorithm specified in code sounds brutal.

Could we perhaps make this a flag that can be passed in on startup? Perhaps it can be a --dev flag so that you can also have more verbose logging. Or it can be independent. I'm open to either.

I do want to emphasize that I think this is very powerful for development and experimentation so I think we want this, but need a way to keep it separate from the way we configure a production algo.

@Kuromesi
Copy link
Author

Kuromesi commented Jan 9, 2025

@kfswain @liu-cong Thanks for your reviews! I will break down my pr, and I do agree that configuring filters in configmap may introduce overhead and instability, I have made some optimization to resolve that.

I added storedFilter filter to store the last orchestrated filter, so if the configmap is not updated, we don't need to regenerate the filter everytime.

type FilterOrchestratorImpl struct {
	datastore    *backend.K8sDatastore
	lastUpdated  string
	storedFilter *filterChainImpl
}

And whenever error happens during filter orchestration, I just return the default filter, so it should work fine.

filter, err := o.orchestrate(f)
if err != nil {
	klog.Errorf("error orchestrating filters: %v", err)
	filter = defaultFilter
}

For the question "do you find modifying the configmap easier than modifying the go code?", I think for me it makes my testing much more easier since I'm trying various filter strategies to find the most suitable for my environment, modifying go code with IDE is definitely easier than configmap, but the rebuild and redeploy is time consuming (and my test environment is deployed in cloud provider which makes this process much more difficult). And I think configuration in configmap is not the final solution (I just use for tests), if you agree that configuring filters is necessary, maybe we should consider designing a new CRD to standardize and simplify the configuration process.

BTW, I do thin we should provide a configurable way for filters (like envoy did) and also provide different filter strategies as candidates, since different environment may require different filter strategy, I'm not sure about that, but I'm making tests to prove that.

I agree that the solution may be problematic and need more tests, I will add a flag to enable this feature and making more tests in my environment!

Also, I utilized llmperf to test the performance in my environment, I can share my results with you if you need. And I am trying to integrate it with the Istio service mesh, and so far it works pretty well. Thanks for your comments again! Please be free to ask me if you have any questions!

@Kuromesi Kuromesi force-pushed the kuromesi/plugablefilter branch from 7de96f5 to 87aeaac Compare January 9, 2025 03:41
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 11, 2025
@Kuromesi Kuromesi force-pushed the kuromesi/plugablefilter branch from 27c8b3c to d7edec8 Compare January 14, 2025 07:36
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 14, 2025
Copy link

netlify bot commented Jan 14, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit d7edec8
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/678614143c45d100085e5168
😎 Deploy Preview https://deploy-preview-169--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@Kuromesi
Copy link
Author

@liu-cong @kfswain Hello guys, I have made some progress since our last discussion. I conducted several experiments with various filter chains on a cluster with 5 nodes, and each node has 16 vCPUs, 60 GB of RAM, and an A10 GPU. 10 pods that loaded the Lora Llama2 model are running in this cluster.

I conducted following sets of experiments:

  • requests to the base llama2 model
  • requests to llama2 model with 10 lora, with 40 connections to different lora
  • requests to llama2 model with 4 lora, with 40 connections to different lora

The following metrics were used to evaluate performance:

  • requests per minute
  • average time to first token (ttft)
  • P95 of ttft
  • Average inter token latency
  • average token throughoutput per second
  • average end to end latency
  • average cache utilization
  • average request queue time

And the filter chains I have tested are:

  • baseline (without and filter chain, LEAST_REQUEST is used)
  • default filter
  • minimum queue size topk filter (always select the pod with minimum queue size)
  • minimum kv cache topk filter (always select the pod with minimum kv cache usage)
  • queue + kv cache filter (combine the least_queuing filter and least_kv_cache filter)
  • lora + minimum queue size filter (combine the lora affinity filter and minimum queue size filter)
  • lora + minimum kv cache filter (combine the lora affinity filter and minimum kv cache usage filter)

And the results can be summarized as:

  • for base model requests, default filter does not show better performance compare with queue + kv cache filter and minimum kv cache filter, it seems that the KV cache is the key factor affecting performance.
  • for 10 lora requests, default filter shows better performance, with approximately 30% enhancement compared to other filter chains.
  • for 4 lora requests, queue + kv cache filter shows better performance, and other filters also show better performance compare with the default filter.

In conclusion, I think configurable filter chains are needed not only in development stage, but also in production stage, since different environment needs different filter chains.

I can provide the complete test results if you need!

@kfswain
Copy link
Collaborator

kfswain commented Jan 15, 2025

Honestly, if you don't mind, the complete test results would be great to see just in general!

@kfswain
Copy link
Collaborator

kfswain commented Jan 15, 2025

WRT:

I think configurable filter chains are needed not only in development stage, but also in production stage, since different environment needs different filter chains.

That is very likely true, but we might want a more gentle approach. It's my supposition that the load balancing algo will not change, in production, on-the-fly. My justification is that there are going to be many workloads all running on the same pool, so altering the way they share resources in real time could lead to quite a bit of footshoot.

Which is why I think relying on the flags passed at startup that can specify a set of pre-determined algorithms, and/or a flag that lets someone create a configurable filter chain on the fly (We would have to specify that its a "USE AT OWN RISK" style. As there is no way to test the combinatorial explosion of options that will/would be possible).

@kfswain
Copy link
Collaborator

kfswain commented Jan 15, 2025

Some of this conversation is discussed here: #162

And here: https://docs.google.com/document/d/1RKVFAgerUoIMhClU9gRduXoF8Hq2tzzVOHdPvkQPKgs/edit?disco=AAABcAnDeEQ

Would love your insights there!

@liu-cong
Copy link
Contributor

@Kuromesi Thanks for the detailed benchmarks and insights! I agree that we currently don't have a single algorithm that works the best for all environments, and therefore some configurability is likely desired. However I am very concerned about the configuration interface. IMO exposing the entire filter tree as a text format is very difficult to use, and error prone. And there is no good way to prevent regressions such as changing filter names.

I think the following can be explored:

  1. Make the algo simpler and smarter. There are some ideas to simplify the current filter tree implementation, such as Explore if we can simplify the filter tree in pkg/ext_proc/scheduling/scheduler.go #89. For example, if we know LoRA is not enabled, we will not use the LoRA related filters, which will significantly simplify the filters.
  2. Collect some use case and start exposing simple configuration knobs, such as the queue threshold.
  3. For advanced users, explore extension mechanisms such as allowing users to plugin in their own scheduler.

Also cc @ahg-g @Jeffwan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants