In the Readme you said you are mainly only targeting iOS/Android.
What about MacOS, Windows, and Linux?
I am attempting to make a single chat app that can work across platforms including both mobile and desktop, so having support for all platforms is a key part of that.
Additionally I would love native builtin support for embedding models here as well, a key part of various RAG pipelines.
Also I would like to see support for MTP, which was recently added to Llama.cpp. This can help increase speeds with speculative decoding which could be of great help on mobile.
And another thing that I would really love to have native support for is batching multiple sequences at once for higher parallelism, as well as potentially (though less urgent) server-style continuous batching.
I believe most of these things are supported within Llama.cpp itself and just need to be exposed and optimized for this library.
In the Readme you said you are mainly only targeting iOS/Android.
What about MacOS, Windows, and Linux?
I am attempting to make a single chat app that can work across platforms including both mobile and desktop, so having support for all platforms is a key part of that.
Additionally I would love native builtin support for embedding models here as well, a key part of various RAG pipelines.
Also I would like to see support for MTP, which was recently added to Llama.cpp. This can help increase speeds with speculative decoding which could be of great help on mobile.
And another thing that I would really love to have native support for is batching multiple sequences at once for higher parallelism, as well as potentially (though less urgent) server-style continuous batching.
I believe most of these things are supported within Llama.cpp itself and just need to be exposed and optimized for this library.