-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Qwen 2.5 #2088
base: master
Are you sure you want to change the base?
Add Qwen 2.5 #2088
Conversation
Thanks for the PR! Before review, could you please do a forward pass and match the output with HF's Qwen? Also, let's make it a draft PR till then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a cursory glance. Let's do the weight conversion and numerics check first!
to fix code format error |
@shivance - let us know when this PR is ready for review. Thanks! |
@abheesht17 I have got tokenizer working currently, I am working on matching output of HF model and keras model. |
Great, no hurry. Was just checking. Do ping if you hit any blockers :) |
@abheesht17 I see that in newer checkpoint conversion script we use
instead of old kernel assign
Has API changed for assigning bias as well? Why was the new method created, What is the difference? |
@abheesht17 upon weight loading, outputs look like this!
succeeds, i.e. absolute tolerance 1e-3. I am testing at fp32, since it's a 0.5B model. |
@abheesht17 i have marked this PR as ready for review |
Great. Were you able to bring the difference in numerics down to 1e-5? Might be worth checking layer-by-layer which one's causing an issue. |
Edit: never mind, the conversion script is part of the PR. |
@abheesht17 here is the colab version of conversion script. |
@abheesht17 did you get a chance to inspect the delta in output? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Just some initial comments and questions.
@@ -77,7 +77,9 @@ def build(self, input_shape): | |||
# Defer packer creation to `build()` so that we can be sure tokenizer | |||
# assets have loaded when restoring a saved model. | |||
self.packer = StartEndPacker( | |||
start_value=self.tokenizer.start_token_id, | |||
start_value=self.tokenizer.start_token_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? We pass add_start_value=self.add_start_token
below when we call the layer. Seems simpler to configure the layer so the packer always knows the start value. And if a users was calling the packer directly they could just do add_start_token=True
during call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, if you check qwen tokenizer config, it doesn't have a bos token. so in start end packer, it throws exception while it tries to access start_token_id, since it's not even there.
stacktrace:
> Keras 3 model and tokenizer loaded.
Traceback (most recent call last):
File "/Users/flip/Desktop/Projects/keras-hub/tools/checkpoint_conversion/convert_qwen_checkpoints.py", line 307, in <module>
app.run(main)
File "/Users/flip/.pyenv/versions/3.11.10/envs/qwen/lib/python3.11/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/Users/flip/.pyenv/versions/3.11.10/envs/qwen/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
^^^^^^^^^^
File "/Users/flip/Desktop/Projects/keras-hub/tools/checkpoint_conversion/convert_qwen_checkpoints.py", line 293, in main
test_tokenizer(keras_hub_tokenizer, hf_tokenizer)
File "/Users/flip/Desktop/Projects/keras-hub/tools/checkpoint_conversion/convert_qwen_checkpoints.py", line 234, in test_tokenizer
keras_hub_output = keras_hub_preprocessor(
^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/flip/.pyenv/versions/3.11.10/envs/qwen/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/Users/flip/Desktop/Projects/keras-hub/keras_hub/src/models/causal_lm_preprocessor.py", line 80, in build
start_value=self.tokenizer.start_token_id,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/flip/.pyenv/versions/3.11.10/envs/qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1928, in __getattr__
raise AttributeError(
AttributeError: 'Qwen2Tokenizer' object has no attribute 'start_token_id'. Did you mean: 'pad_token_id'?
@keras_hub_export("keras_hub.models.Qwen2Backbone") | ||
class Qwen2Backbone(Backbone): | ||
""" | ||
#TODO: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add this in before merge! Even just a one liner. "The Qwen2 decoder network."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
|
||
|
||
@keras_hub_export("keras_hub.models.Qwen2Backbone") | ||
class Qwen2Backbone(Backbone): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How different is qwen 1 from qwen 2 btw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How different is qwen 1 from qwen 2 btw?
There is no difference between qwen series.
misc_special_tokens -= {eos_token} | ||
|
||
# Add misc special tokens | ||
for i, token in enumerate(misc_special_tokens): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these used for? I don't see these used anywhere. A lot of tokenizers have reserved and unused tokens (e.g. for bert the first thousand I think), we don't generally give them special treatment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just followed llama3 tokenizer!
self._add_special_token(token, f"special_token_{i:03d}") | ||
special_tokens.add(token) | ||
|
||
# Add alternate EOS token if needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when is this needed? and why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad.
rope_max_wavelength=hf_model.config.rope_theta, | ||
use_sliding_window=hf_model.config.use_sliding_window, | ||
sliding_window_size=hf_model.config.sliding_window, | ||
# dtype="bfloat16" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we saving at full precision? We should probably save as the same dtype we are converting from. If we are taking a bfloat16 bunch of weights and saving then as float32 (Keras default) we are just wasting a ton a disk space for now gain. We can still load a different dtype than save.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while doing weight matching, if i load models in float32, the weights match with atol of 1e-3, however the delta is quite wide when i load in bfloat16. The difference in the intermediate outputs starts happening from first layernorm (where casting to f32, applying norm, and casting back to bf16 happens)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattdangerw / @abheesht17 did you get a chance to take a look at it?
I think it's necessary to check in detail where the error is. As much as possible, we should ensure that the fp32 error is around 1e-5 under the torch backend. The maximum error of bf16 should not exceed 1e-2. |
@mattdangerw / @abheesht17 / @divyashreepathihalli How do you completely disable MPS backend with Keras? Please take a look at latest conversion script, despite I am moving model to cpu using keras.device and also moving inputs, reversible embedding call step exits with stacktrace
the stacktrace points that somewhere allocation is still happening on MPS, which I have already disabled !! |
Closes #2078
References:
Qwen 2.5 uses Qwen2 backbone from Huggingface Transformers
HF Config path
HF Source Code