There is currently a mismatch between what our configurations allow and what our backends support regarding dynamic model loading:
- VLLM Backend: Model weights are loaded to the GPU at startup. It is not possible for a client to connect and specify a new base model on the fly.
- Engine Backend: It is technically possible to dynamically load a new base model.
Our code currently allows clients to specify a new base model regardless of the backend. If VLLM is running, the code will still attempt to sample against the requested model, even though VLLM cannot load it. I think that we should be explicit in stating what we expect from our architecture and from that choice we can reduce complexity elsewhere
There is currently a mismatch between what our configurations allow and what our backends support regarding dynamic model loading:
Our code currently allows clients to specify a new base model regardless of the backend. If VLLM is running, the code will still attempt to sample against the requested model, even though VLLM cannot load it. I think that we should be explicit in stating what we expect from our architecture and from that choice we can reduce complexity elsewhere