You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Even if it is only for CUDA, if it is possible to maintain this without major issues, it would be a massive quality of life improvement for those that do not have beefy hardware.
Thank you for your time and consideration.
Motivation
This will allow those without great hardware to experience greater context lengths and better performance from MOE models by being able to offload more layers onto GPU due to the decrease of memory requirement from KV Cache.