Skip to content

Conversation

@DragonStuff
Copy link
Contributor

@DragonStuff DragonStuff commented Oct 18, 2025

Please review thoroughly as this is my first application-level dstack change. I apologize for not discussing it first with the maintainers, but I was inspired and this solves a real problem!

This solves a problem that dynamic providers such as vast.ai have; that being they have completely different configs available at any given time, and you are essentially stuck with deploying different tasks/services or accepting some level of unreliability when it comes to what machines it finds. As an example, if you want a H100 (because there is only one available) and an RTX5090, or if you want to split multi-region, etc, you basically can't do this today within the same config.

This PR adds the concept of Replica Groups, enabling heterogeneous instance types within a single service with group-aware autoscaling. I made sure to keep and test backwards compatibility on both the client and server.

Example config:

regions: 
  - jp-japan

# Define replica groups with different GPU types
replica_groups:
  - name: l40s-gpu
    replicas: 1
    resources:
      gpu:
        name: L40S
        count: 1
  
  - name: a100-gpu
    replicas: 1
    resources:
      gpu:
        name: A100
        count: 1

Which yields:

Project          main                                                
 User             admin                                               
 Configuration    .dstack.yml                                         
 Type             service                                             
 Replica groups   l40s-gpu ×1 (cpu=2.. mem=8GB.. disk=100GB.. L40S:1) 
                  a100-gpu ×1 (cpu=2.. mem=8GB.. disk=100GB.. A100:1) 
 Spot policy      auto                                                
 Max price        -                                                   
 Retry policy     1d[no-capacity, error, interruption]                
 Creation policy  reuse-or-create                                     
 Idle duration    5m                                                  
 Max duration     -                                                   
 Reservation      -                                                   

 #  BACKEND                      RESOURCES                                                               INSTANCE TYPE  PRICE         
    a100-gpu:                    No matching instance offers available.                                                               
                                 Possible reasons:                                                                                    
                                 https://dstack.ai/docs/guides/troubleshooting/#no-offers                                             
 1  l40s-gpu: vastai (jp-japan)  cpu=32 mem=32GB disk=100GB L40S:48GB:1                                  24964757       $0.4852       
 2  l40s-gpu: vastai (jp-japan)  cpu=32 mem=32GB disk=100GB L40S:48GB:1                                  24964762       $0.4852  busy 
 3  l40s-gpu: vastai (jp-japan)  cpu=32 mem=32GB disk=100GB L40S:48GB:1                                  24964753       $0.6185      

Essentially, from a config perspective, this does the following:

  • Added ReplicaGroup model that inherits from ProfileParams

    • name: Unique group identifier
    • replicas: Range (min..max) for autoscaling or fixed count
    • resources: Group-specific GPU/CPU/memory requirements
    • All ProfileParams fields: backends, regions, instance_types, spot_policy, etc.
  • Updated ServiceConfigurationParams:

    • Added replica_groups field (mutually exclusive with replicas)
    • Validation: unique names, non-empty list, range requires scaling
    • Backward compatible with existing replicas field

There was a terrible bug when migrating from replicas to replica groups where it would try and pick a None (essentially any available machine) due to the previous jobs. _migrate_legacy_job_replica_groups was added to patch this. There might be a more elegant way to do this.

The documentation and contribution guide was also updated but please let me know if you'd like me to add more / less detail. Similarly I added as many tests as I could whenever I encountered something novel in the code path.

type: service
port: 8000
commands: ["python app.py"]
replica_groups:
  - name: l40s-gpu
    replicas: 1..3  # autoscalable
    resources:
      gpu: L40S

  - name: h100-gpu
    replicas: 2  # fixed
    regions: [us-east]
    resources:
      gpu: H100

- Added the ability to define multiple replica groups with distinct configurations, including resource requirements and autoscaling behavior.
- Updated relevant documentation to reflect the new replica groups feature.
- Enhanced CLI output to display job plans with group names for better clarity.
- Ensured backward compatibility by excluding replica groups from JSON when not set.
- Added tests to validate the functionality and backward compatibility of replica groups.

This change allows for more flexible service configurations, enabling users to manage different types of resources and scaling strategies within a single service.
- Implemented migration for legacy jobs that lack a replica_group_name, ensuring they are correctly assigned to the appropriate replica groups.
- Updated CLI output to display group-specific properties such as spot policy, regions, and backends for better clarity.
- Enhanced tests to validate the migration process and ensure that jobs are correctly assigned to their respective groups.
- Improved handling of pool offers to accommodate multiple jobs in replica groups, ensuring all GPU types are considered.

This update improves the robustness of the service configuration and enhances user experience by providing clearer information in the CLI.
@r4victor
Copy link
Collaborator

@DragonStuff, thanks for the PR. I'd like to first ensure we're on the same page regarding the replica groups use cases and how they integrate with other dstack features and UX.

In dstack and typically in orchestration systems, service replicas are copies of the same app, possible running in different regions/availability zones. I understand replica groups allow running replicas in different regions/availability zones, this is certainly needed, but we were thinking supporting replicas in different regions/availability zones in a simpler way, e.g. via a property like placement: spread.

Can you elaborate why you'd want running different replicas on different instance types? Assuming replicas run the same app (i.e. same image, commands, etc). That's not clear to me. You're writing about limited capacity of instances types, but can't you just specify a list of gpus / instance types you're fine with and get the same effect? dstack will provision replicas on any of the specified instance types, e.g. one h100 and one RTX5090 if only one of each is available.

How do replica groups work with autoscaling? If multiple replica groups are autoscaled, how does dstack choose where to provision a new replica?

Also please clarify how replica groups work with rolling deployments.

If you're unsure about some design decisions, please let us know so we can discuss.

@DragonStuff
Copy link
Contributor Author

@r4victor Thank you for the questions!

Firstly, for capacity issues alone, just listing multiple instance types (e.g., gpu: [H100, RTX5090]) would work if I was just using a single provider, or if each provider's workload could be separated.

However, replica groups solve several additional use cases:

Firstly, different placement constraints per group. If you want to run fixed replicas in one region, and autoscaling replicas in another, this is not currently possible. Example: Always keep 1 replica in ap-northeast-1 (low latency for most users), autoscale overflow in cheaper regions. Unless my understanding is incorrect, we also can't achieve this with placement: spread alone since we want deterministic placement per group.

Secondly, different autoscaling behavior per group. Such as having some groups fixed (min=max), others autoscalable (min < max). Example: 1 H100 always-on + 0-5 RTX5090 for overflow <-- this is how we are running our app right now for testing. The current replicas field applies uniform scaling to all replicas.

Lastly (there are probably more reasons but these are the ones that come to mind), different spot policies per group. Having critical replicas on on-demand on certain providers (again let's say there are 3 providers, and we have to match the config for each one, not just let the hunter pick everything for us), overflow on spot (Vast.ai can't easily be mixed with Runpod).

When multiple replica groups are autoscaled, here's how it works:

Current behavior:

  1. The autoscaler (e.g., RPSAutoscaler) calculates a single replica delta for the entire service (e.g., "scale up by 2")
  2. scale_run_replicas() distributes this across groups:
    • Priority 1: Groups below minimum (scale regardless of autoscalability)
    • Priority 2: Autoscalable groups (min != max) not yet at max
    • Priority 3: During rolling deployment, even fixed groups can exceed max temporarily
  3. Pick the first eligible group for each new replica (lines 1539-1541 in runs.py).

Example:

replica_groups:
  - name: h100
    replicas: 1..3      # Autoscalable
  - name: rtx5090
    replicas: 0..5      # Autoscalable

If autoscaler says "scale up by 2":

  • Currently: Both replicas go to h100 (first eligible group)
  • Future improvement: Could round-robin or use cost-based selection -- I think cost-based makes the most sense but I chose the first one that matched for simplicity in this initial version (runs.py:1488-1577).

Rolling deployments work across all groups simultaneously (process_runs.py:483-514)

  1. Scale up: Create new replicas (deployment_num + 1) for each group, temporarily exceeding max

    • Uses allow_exceeding_max=True parameter
    • Each group can have up to max + 1 replicas during deployment
  2. Wait: New replicas start and register with gateway

  3. Scale down: Terminate old replicas (out-of-date deployment_num)

    • Prioritizes terminating out-of-date replicas even from fixed groups (lines 1449-1459)

We could implement placement: spread as a complementary feature:

  • placement: spread → dstack automatically distributes replicas across available regions
  • replica_groups → advanced heterogeneous configurations

Both could coexist. What do you think? Sorry for the long message!

@r4victor
Copy link
Collaborator

@DragonStuff, that clarifies a lot, thanks! The main question for me is whether we should support your use cases via replica groups or if there is a simpler way to support your use cases. Replica groups would be a huge feature and will require a lot of maintenance and integration with other (new) features. For example, we'll add multi-node replicas to be able to run replicas that cannot fit on one node, so you'll have service->multiple replica groups->multiple replicas->multiple replica jobs, which is quite a few.

I'm thinking about other orchestration systems, and how you typically deploy such services. Take Kubernetes or ECS – they don't allow running different replicas with different resources, so you typically deploy different pods/tasks if you need different resources for replicas. What do you see as the main issues doing the same as dstack, i.e. running multiple services vs one service if you need different replicas configured differently.

@DragonStuff
Copy link
Contributor Author

DragonStuff commented Oct 20, 2025

I'm thinking about other orchestration systems, and how you typically deploy such services. Take Kubernetes or ECS – they don't allow running different replicas with different resources, so you typically deploy different pods/tasks if you need different resources for replicas. What do you see as the main issues doing the same as dstack, i.e. running multiple services vs one service if you need different replicas configured differently.

I see what you mean; but I actually approach this from another perspective. A workload can run on any given processor if the resources value matches. All that matters is that you have CPU / memory limits set. Of course, when you start having nodes that for instance have local storage or some other unique characteristic, you want to taint / label them and assign workloads.

Maybe I'm alone in thinking this, but having a consumer LLM application is pretty much like having a deployment in Kubernetes. You're probably going to want a HPA or some queue / request based scaling to control the number of replicas. However, ultimately, your cluster will probably be made up of Intel and AMD based nodes with different CPU and memory configs. But your app will "run the same" on each.

Of course, when it comes to LLMs (just like any other app), you have constraints like:

  • I want to have multiple AZs (in this case it would most likely be providers)
  • I don't want to run it on a node in a different region (latency)
  • Get the best price-performance (cost-effectiveness)

running multiple services vs one service if you need different replicas configured differently.

This is an option, but then you have to make your application "fail over" between these endpoints instead of the actual service with the proxy (dstack) transparently providing the OpenAI compatible endpoint.

Totally happy to consider another option as well.

For example, we'll add multi-node replicas to be able to run replicas that cannot fit on one node, so you'll have service->multiple replica groups->multiple replicas->multiple replica jobs, which is quite a few.

This is very interesting, but I wonder how it would work in practice? Wouldn't this be transparent to the user? As an example if a provider has 4xH100 and 1xH100 nodes, and you have set 5 replicas with 1xH100 each, wouldn't that start one 4xH100, and one 1xH100 node? At least, that would be the Kubernetes way (with something like EC2 ASGs). I apologize as I haven't experimented as much with this part of dstack yet.

If we wanted to follow the Kubernetes way as much as possible, introducing labels / taints would not be a bad idea, as we could achieve something similar to replica groups (essentially allowing you to taint replicas away from an instance based on a label like hostname). Of course this creates more complexity due to every provider having dynamic names for every single node, some providers not having the metadata needed, etc. It's also much harder than just defining and getting what you ask for. I can see upsides and downsides to each approach for sure.

@DragonStuff
Copy link
Contributor Author

As an aside, right now there is a global Runpod outage caused by AWS's us-east-1 outage, which doesn't affect our production deploy running this branch because we are using both Vast.ai and Runpod. Looking at the logs though, it might be worth implementing exponential backoff on the retry.

@r4victor
Copy link
Collaborator

@DragonStuff,

This is an option, but then you have to make your application "fail over" between these endpoints instead of the actual service with the proxy (dstack) transparently providing the OpenAI compatible endpoint.

I see. To make it convenient to set up with dstack, it'd require separating load balancing / router layer from services so that an LB can be configured against multiple services. Ideally that's the path I'd like dstack to take but currently services are high-level encapsulating both compute and load balancing, so, yeah, replica groups make sense.

This is very interesting, but I wonder how it would work in practice? Wouldn't this be transparent to the user?

By multi-node replicas I refer to a service where each replica has multiple jobs/nodes, e.g. each replica is 2 nodes of 8xH100. It had to be transparent cause users need to configure what to run on the nodes. It can be similar to dstack multi-node tasks (you'd specify the number of nodes within a replica), but you'd also specify the number of replicas as you currently do for services. If you have set 5 replicas with 1xH100 each, it's not multi-node replicas.

@r4victor
Copy link
Collaborator

@DragonStuff, conceptually, I think we can make this PR work and introduce replica groups into dstack.

Regarding the code, it requires some cleanup before we'll look into specifics. I'll leave a few comments. Overall, please take care of making the code free from ai bloat/verbosity so that it's easier to review and maintain.

if len(names) != len(set(names)):
raise ValueError("Replica group names must be unique")

# Validate at least one group
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many comments like that and they are redundant

Comment on lines +9 to +10
if TYPE_CHECKING:
from dstack._internal.core.models.configurations import ReplicaGroup, ServiceConfiguration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only for TYPE_CHECKING

if not getattr(run_spec.configuration, "replica_groups", None):
return

from dstack._internal.core.models.runs import get_normalized_replica_groups
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use top-level imports

selected_offers.append((group_name, offer))
remaining_slots -= take_count

# Second pass: Fill remaining slots with cheapest offers globally
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you feel the urge to sum up a piece of code with a comment, you might often refactor it into a function with a proper name for better readability.

@r4victor
Copy link
Collaborator

@jvstme, what are your thoughts on how replica groups work with autoscaling and rolling deployments as proposed?

@DragonStuff
Copy link
Contributor Author

DragonStuff commented Oct 29, 2025

I did some thinking after using LiteLLM over the last week and decided that maybe this would be better if we followed the pattern you suggested (having different services / runs for different compute) and instead add the concept of replica_groups to the gateway / model routing layer similar to what LiteLLM does for load balancing.

https://docs.litellm.ai/docs/routing

Essentially we would define aliases like the above, perhaps in the "primary" service file?

I think it would make the implementation much cleaner and introduce the same feature without complicating future functionality. Of course we can also change the name to reflect the different functionality.

What do you think @r4victor?

@r4victor
Copy link
Collaborator

@DragonStuff, yes, I think this is a much better approach – keeps services simpler and more flexible at the same time. It may be possible to support this with just an alias property in service configurations as you suggest, so that all services that define the same alias are routed via the same endpoint. But one may also need to configure some per-router settings, and for that I think we might need a separate configuration like router configuration.

We've actually planned to implement a router functionality primarily to be able to route between both self-hosted and proprietary models: #1631. If we implement it for dstack services, it should help with your use cases.

One caveat is that we have an ongoing PR that also adds sglang router functionality to gateway so the term router becomes overloaded and we may need to choose a different one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants