-
Couldn't load subscription status.
- Fork 198
feature: implement replica groups in service configurations #3205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
type: service
port: 8000
commands: ["python app.py"]
replica_groups:
- name: l40s-gpu
replicas: 1..3 # autoscalable
resources:
gpu: L40S
- name: h100-gpu
replicas: 2 # fixed
regions: [us-east]
resources:
gpu: H100
- Added the ability to define multiple replica groups with distinct configurations, including resource requirements and autoscaling behavior.
- Updated relevant documentation to reflect the new replica groups feature.
- Enhanced CLI output to display job plans with group names for better clarity.
- Ensured backward compatibility by excluding replica groups from JSON when not set.
- Added tests to validate the functionality and backward compatibility of replica groups.
This change allows for more flexible service configurations, enabling users to manage different types of resources and scaling strategies within a single service.
- Implemented migration for legacy jobs that lack a replica_group_name, ensuring they are correctly assigned to the appropriate replica groups. - Updated CLI output to display group-specific properties such as spot policy, regions, and backends for better clarity. - Enhanced tests to validate the migration process and ensure that jobs are correctly assigned to their respective groups. - Improved handling of pool offers to accommodate multiple jobs in replica groups, ensuring all GPU types are considered. This update improves the robustness of the service configuration and enhances user experience by providing clearer information in the CLI.
|
@DragonStuff, thanks for the PR. I'd like to first ensure we're on the same page regarding the replica groups use cases and how they integrate with other dstack features and UX. In dstack and typically in orchestration systems, service replicas are copies of the same app, possible running in different regions/availability zones. I understand replica groups allow running replicas in different regions/availability zones, this is certainly needed, but we were thinking supporting replicas in different regions/availability zones in a simpler way, e.g. via a property like placement: spread. Can you elaborate why you'd want running different replicas on different instance types? Assuming replicas run the same app (i.e. same image, commands, etc). That's not clear to me. You're writing about limited capacity of instances types, but can't you just specify a list of gpus / instance types you're fine with and get the same effect? dstack will provision replicas on any of the specified instance types, e.g. one h100 and one RTX5090 if only one of each is available. How do replica groups work with autoscaling? If multiple replica groups are autoscaled, how does dstack choose where to provision a new replica? Also please clarify how replica groups work with rolling deployments. If you're unsure about some design decisions, please let us know so we can discuss. |
|
@r4victor Thank you for the questions! Firstly, for capacity issues alone, just listing multiple instance types (e.g., However, replica groups solve several additional use cases: Firstly, different placement constraints per group. If you want to run fixed replicas in one region, and autoscaling replicas in another, this is not currently possible. Example: Always keep 1 replica in Secondly, different autoscaling behavior per group. Such as having some groups fixed (min=max), others autoscalable (min < max). Example: 1 H100 always-on + 0-5 RTX5090 for overflow <-- this is how we are running our app right now for testing. The current Lastly (there are probably more reasons but these are the ones that come to mind), different spot policies per group. Having critical replicas on on-demand on certain providers (again let's say there are 3 providers, and we have to match the config for each one, not just let the hunter pick everything for us), overflow on spot (Vast.ai can't easily be mixed with Runpod). When multiple replica groups are autoscaled, here's how it works: Current behavior:
Example: replica_groups:
- name: h100
replicas: 1..3 # Autoscalable
- name: rtx5090
replicas: 0..5 # AutoscalableIf autoscaler says "scale up by 2":
Rolling deployments work across all groups simultaneously (
We could implement
Both could coexist. What do you think? Sorry for the long message! |
|
@DragonStuff, that clarifies a lot, thanks! The main question for me is whether we should support your use cases via replica groups or if there is a simpler way to support your use cases. Replica groups would be a huge feature and will require a lot of maintenance and integration with other (new) features. For example, we'll add multi-node replicas to be able to run replicas that cannot fit on one node, so you'll have service->multiple replica groups->multiple replicas->multiple replica jobs, which is quite a few. I'm thinking about other orchestration systems, and how you typically deploy such services. Take Kubernetes or ECS – they don't allow running different replicas with different resources, so you typically deploy different pods/tasks if you need different resources for replicas. What do you see as the main issues doing the same as dstack, i.e. running multiple services vs one service if you need different replicas configured differently. |
I see what you mean; but I actually approach this from another perspective. A workload can run on any given processor if the resources value matches. All that matters is that you have CPU / memory limits set. Of course, when you start having nodes that for instance have local storage or some other unique characteristic, you want to taint / label them and assign workloads. Maybe I'm alone in thinking this, but having a consumer LLM application is pretty much like having a deployment in Kubernetes. You're probably going to want a HPA or some queue / request based scaling to control the number of replicas. However, ultimately, your cluster will probably be made up of Intel and AMD based nodes with different CPU and memory configs. But your app will "run the same" on each. Of course, when it comes to LLMs (just like any other app), you have constraints like:
This is an option, but then you have to make your application "fail over" between these endpoints instead of the actual service with the proxy ( Totally happy to consider another option as well.
This is very interesting, but I wonder how it would work in practice? Wouldn't this be transparent to the user? As an example if a provider has 4xH100 and 1xH100 nodes, and you have set 5 replicas with 1xH100 each, wouldn't that start one 4xH100, and one 1xH100 node? At least, that would be the Kubernetes way (with something like EC2 ASGs). I apologize as I haven't experimented as much with this part of If we wanted to follow the Kubernetes way as much as possible, introducing labels / taints would not be a bad idea, as we could achieve something similar to replica groups (essentially allowing you to taint replicas away from an instance based on a label like |
|
As an aside, right now there is a global Runpod outage caused by AWS's us-east-1 outage, which doesn't affect our production deploy running this branch because we are using both Vast.ai and Runpod. Looking at the logs though, it might be worth implementing exponential backoff on the retry. |
I see. To make it convenient to set up with dstack, it'd require separating load balancing / router layer from services so that an LB can be configured against multiple services. Ideally that's the path I'd like dstack to take but currently services are high-level encapsulating both compute and load balancing, so, yeah, replica groups make sense.
By multi-node replicas I refer to a service where each replica has multiple jobs/nodes, e.g. each replica is 2 nodes of 8xH100. It had to be transparent cause users need to configure what to run on the nodes. It can be similar to dstack multi-node tasks (you'd specify the number of |
|
@DragonStuff, conceptually, I think we can make this PR work and introduce replica groups into dstack. Regarding the code, it requires some cleanup before we'll look into specifics. I'll leave a few comments. Overall, please take care of making the code free from ai bloat/verbosity so that it's easier to review and maintain. |
| if len(names) != len(set(names)): | ||
| raise ValueError("Replica group names must be unique") | ||
|
|
||
| # Validate at least one group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many comments like that and they are redundant
| if TYPE_CHECKING: | ||
| from dstack._internal.core.models.configurations import ReplicaGroup, ServiceConfiguration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only for TYPE_CHECKING
| if not getattr(run_spec.configuration, "replica_groups", None): | ||
| return | ||
|
|
||
| from dstack._internal.core.models.runs import get_normalized_replica_groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use top-level imports
| selected_offers.append((group_name, offer)) | ||
| remaining_slots -= take_count | ||
|
|
||
| # Second pass: Fill remaining slots with cheapest offers globally |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you feel the urge to sum up a piece of code with a comment, you might often refactor it into a function with a proper name for better readability.
|
@jvstme, what are your thoughts on how replica groups work with autoscaling and rolling deployments as proposed? |
|
I did some thinking after using LiteLLM over the last week and decided that maybe this would be better if we followed the pattern you suggested (having different services / runs for different compute) and instead add the concept of replica_groups to the gateway / model routing layer similar to what LiteLLM does for load balancing. https://docs.litellm.ai/docs/routing Essentially we would define aliases like the above, perhaps in the "primary" service file? I think it would make the implementation much cleaner and introduce the same feature without complicating future functionality. Of course we can also change the name to reflect the different functionality. What do you think @r4victor? |
|
@DragonStuff, yes, I think this is a much better approach – keeps services simpler and more flexible at the same time. It may be possible to support this with just an alias property in service configurations as you suggest, so that all services that define the same alias are routed via the same endpoint. But one may also need to configure some per-router settings, and for that I think we might need a separate configuration like router configuration. We've actually planned to implement a router functionality primarily to be able to route between both self-hosted and proprietary models: #1631. If we implement it for dstack services, it should help with your use cases. One caveat is that we have an ongoing PR that also adds sglang router functionality to gateway so the term router becomes overloaded and we may need to choose a different one. |
Please review thoroughly as this is my first application-level
dstackchange. I apologize for not discussing it first with the maintainers, but I was inspired and this solves a real problem!This solves a problem that dynamic providers such as vast.ai have; that being they have completely different configs available at any given time, and you are essentially stuck with deploying different tasks/services or accepting some level of unreliability when it comes to what machines it finds. As an example, if you want a H100 (because there is only one available) and an RTX5090, or if you want to split multi-region, etc, you basically can't do this today within the same config.
This PR adds the concept of Replica Groups, enabling heterogeneous instance types within a single service with group-aware autoscaling. I made sure to keep and test backwards compatibility on both the client and server.
Example config:
Which yields:
Essentially, from a config perspective, this does the following:
Added
ReplicaGroupmodel that inherits fromProfileParamsname: Unique group identifierreplicas: Range (min..max) for autoscaling or fixed countresources: Group-specific GPU/CPU/memory requirementsProfileParamsfields: backends, regions, instance_types, spot_policy, etc.Updated
ServiceConfigurationParams:replica_groupsfield (mutually exclusive withreplicas)replicasfieldThere was a terrible bug when migrating from replicas to replica groups where it would try and pick a None (essentially any available machine) due to the previous jobs.
_migrate_legacy_job_replica_groupswas added to patch this. There might be a more elegant way to do this.The documentation and contribution guide was also updated but please let me know if you'd like me to add more / less detail. Similarly I added as many tests as I could whenever I encountered something novel in the code path.