-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mlx_lm: Add Streaming Capability to Generate Function #807
Conversation
Just curious about this. If you want a streaming generator you could use I think it might be better rather than having two generators to figure out the right interface for the generator we already have ( |
@awni can you recommend please a way to support streaming without needing to patch mlx_lm? I started this PR which substitutes bitsandbytes for MLX, however is not working with streaming. lllyasviel/Omost#54 |
Yes, I considered this. However, The second issue, as you already mentioned, is that you then have to handle tokenization and detokenization yourself. You also need to add In my opinion, it’s a bit too much overhead on the end user’s side to enable streaming this way. It should be easily enabled like in the OpenAI API by a parameter passed to the method or just another separate function. I know that my solution is not perfect and it might be refactored. Here is the comparison between the generate_step and the PR code solution from the end-user side: import mx
prompt_tokens = mx.array(tokenizer.encode(prompt))
detokenizer = tokenizer.detokenizer
detokenizer.reset()
for token in (
generate_step(
prompt_tokens,
model,
temp,
repetition_penalty,
repetition_context_size,
top_p,
logit_bias,
)
):
if token == tokenizer.eos_token_id:
break
detokenizer.add_token(token)
yield detokenizer.last_segment vs. response_generator = generate(
model,
tokenizer,
prompt=prompt,
max_tokens=max_tokens,
temp=temp,
streaming=True
)
for token in response_generator:
yield token |
Yes this makes sense. However, the |
I modified the code to have two functions:
from mlx_lm import load, stream_generate
repo = "mlx-community/Mistral-7B-Instruct-v0.3-4bit"
model, tokenizer = load(repo)
prompt = "Write a story about Einstein"
for t in stream_generate(model, tokenizer, prompt, max_tokens=512):
print(t, end="", flush=True)
print() I think there is some opportunity to refactor |
@dfl the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the addition!
@awni I am trying to match the transformers API, which uses TextIteratorStreamer (as well as AutoModelForCausalLM and AutoTokenizer). I guess the APIs are too different and it's not going to work so easily. The AutoModelForCausalLM.generate and mlx's generate functions do not have matching arguments, so I need some wrapper function... also with a lambda to work as Thread for Gradio -- seems beyond my current python abilities 🫤 |
sorry I don't understand... how do I get these latest features in my pip install? |
@dfl I just did a patch release so you can get 0.14.3 which should have |
Add streaming feature to text generation function
generate
_generate
to handle both streaming and non-streaming text generation modes.generate
function to return a generator when streaming is enabled, and the full generated text when streaming is disabled.