Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Qwen 2.5 #2088

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open

Add Qwen 2.5 #2088

wants to merge 17 commits into from

Conversation

shivance
Copy link
Collaborator

@shivance shivance commented Feb 9, 2025

Closes #2078

References:

Qwen 2.5 uses Qwen2 backbone from Huggingface Transformers
HF Config path
HF Source Code

@shivance shivance self-assigned this Feb 9, 2025
@shivance shivance changed the title Add Qwen 2.5 [WIP] Add Qwen 2.5 Feb 9, 2025
@abheesht17
Copy link
Collaborator

abheesht17 commented Feb 10, 2025

Thanks for the PR! Before review, could you please do a forward pass and match the output with HF's Qwen? Also, let's make it a draft PR till then

Copy link
Collaborator

@abheesht17 abheesht17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a cursory glance. Let's do the weight conversion and numerics check first!

@shivance shivance marked this pull request as draft February 10, 2025 10:00
@divyashreepathihalli
Copy link
Collaborator

divyashreepathihalli commented Feb 12, 2025

to fix code format error
you will need to run shell/api_gen.sh at root
if you don't have ruff install ruff pip install ruff and then run shell/format.sh at root.

@abheesht17
Copy link
Collaborator

@shivance - let us know when this PR is ready for review. Thanks!

@shivance
Copy link
Collaborator Author

shivance commented Feb 18, 2025

@abheesht17 I have got tokenizer working currently, I am working on matching output of HF model and keras model.
Thanks for patience!

Screenshot 2025-02-18 at 10 23 36 PM

@abheesht17
Copy link
Collaborator

Great, no hurry. Was just checking. Do ping if you hit any blockers :)

@shivance
Copy link
Collaborator Author

shivance commented Feb 18, 2025

@abheesht17 I see that in newer checkpoint conversion script we use set_weights method, eg.

keras_hub_model.transformer_layers[
            i
        ]._self_attention_layer._query_dense.set_weights(
            [
                hf_model.model.layers[i]
                .self_attn.q_proj.weight.T.reshape(
                    config.hidden_size,
                    config.num_attention_heads,
                    config.hidden_size // config.num_attention_heads,
                )
                .detach()
                .cpu()
                .float()
                .numpy()
            ]
        )

instead of old kernel assign

keras_hub_model.get_layer(
            f"f_net_layer_{i}"
        )._intermediate_dense.kernel.assign(
            hf_wts[f"encoder.layer.{i}.intermediate.dense.weight"]
            .transpose(1, 0)
            .numpy()
        )

Has API changed for assigning bias as well? Why was the new method created, What is the difference?

@shivance
Copy link
Collaborator Author

Screenshot 2025-02-19 at 12 37 18 AM

@abheesht17 upon weight loading, outputs look like this!
there is still some delta here,

np.testing.assert_allclose(
            keras_hub_logits, hf_output_logits, atol=1e-3
        )

succeeds, i.e. absolute tolerance 1e-3.

I am testing at fp32, since it's a 0.5B model.

@shivance shivance changed the title [WIP] Add Qwen 2.5 Add Qwen 2.5 Feb 18, 2025
@shivance shivance marked this pull request as ready for review February 18, 2025 19:20
@shivance
Copy link
Collaborator Author

@abheesht17 i have marked this PR as ready for review

@abheesht17
Copy link
Collaborator

abheesht17 commented Feb 20, 2025

@abheesht17 i have marked this PR as ready for review

Great. Were you able to bring the difference in numerics down to 1e-5? Might be worth checking layer-by-layer which one's causing an issue.

@abheesht17
Copy link
Collaborator

abheesht17 commented Feb 21, 2025

@shivance - can you please share the weight conversion Colab as well?

Edit: never mind, the conversion script is part of the PR.

@shivance
Copy link
Collaborator Author

@abheesht17 here is the colab version of conversion script.

@shivance
Copy link
Collaborator Author

@abheesht17 did you get a chance to inspect the delta in output?

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just some initial comments and questions.

@@ -77,7 +77,9 @@ def build(self, input_shape):
# Defer packer creation to `build()` so that we can be sure tokenizer
# assets have loaded when restoring a saved model.
self.packer = StartEndPacker(
start_value=self.tokenizer.start_token_id,
start_value=self.tokenizer.start_token_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? We pass add_start_value=self.add_start_token below when we call the layer. Seems simpler to configure the layer so the packer always knows the start value. And if a users was calling the packer directly they could just do add_start_token=True during call.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, if you check qwen tokenizer config, it doesn't have a bos token. so in start end packer, it throws exception while it tries to access start_token_id, since it's not even there.

stacktrace:
> Keras 3 model and tokenizer loaded.
Traceback (most recent call last):
  File "/Users/flip/Desktop/Projects/keras-hub/tools/checkpoint_conversion/convert_qwen_checkpoints.py", line 307, in <module>
    app.run(main)
  File "/Users/flip/.pyenv/versions/3.11.10/envs/qwen/lib/python3.11/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/Users/flip/.pyenv/versions/3.11.10/envs/qwen/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
             ^^^^^^^^^^
  File "/Users/flip/Desktop/Projects/keras-hub/tools/checkpoint_conversion/convert_qwen_checkpoints.py", line 293, in main
    test_tokenizer(keras_hub_tokenizer, hf_tokenizer)
  File "/Users/flip/Desktop/Projects/keras-hub/tools/checkpoint_conversion/convert_qwen_checkpoints.py", line 234, in test_tokenizer
    keras_hub_output = keras_hub_preprocessor(
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/flip/.pyenv/versions/3.11.10/envs/qwen/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/Users/flip/Desktop/Projects/keras-hub/keras_hub/src/models/causal_lm_preprocessor.py", line 80, in build
    start_value=self.tokenizer.start_token_id,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/flip/.pyenv/versions/3.11.10/envs/qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1928, in __getattr__
    raise AttributeError(
AttributeError: 'Qwen2Tokenizer' object has no attribute 'start_token_id'. Did you mean: 'pad_token_id'?

@keras_hub_export("keras_hub.models.Qwen2Backbone")
class Qwen2Backbone(Backbone):
"""
#TODO:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add this in before merge! Even just a one liner. "The Qwen2 decoder network."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!



@keras_hub_export("keras_hub.models.Qwen2Backbone")
class Qwen2Backbone(Backbone):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How different is qwen 1 from qwen 2 btw?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How different is qwen 1 from qwen 2 btw?

There is no difference between qwen series.

misc_special_tokens -= {eos_token}

# Add misc special tokens
for i, token in enumerate(misc_special_tokens):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these used for? I don't see these used anywhere. A lot of tokenizers have reserved and unused tokens (e.g. for bert the first thousand I think), we don't generally give them special treatment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just followed llama3 tokenizer!

self._add_special_token(token, f"special_token_{i:03d}")
special_tokens.add(token)

# Add alternate EOS token if needed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when is this needed? and why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad.

rope_max_wavelength=hf_model.config.rope_theta,
use_sliding_window=hf_model.config.use_sliding_window,
sliding_window_size=hf_model.config.sliding_window,
# dtype="bfloat16"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we saving at full precision? We should probably save as the same dtype we are converting from. If we are taking a bfloat16 bunch of weights and saving then as float32 (Keras default) we are just wasting a ton a disk space for now gain. We can still load a different dtype than save.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while doing weight matching, if i load models in float32, the weights match with atol of 1e-3, however the delta is quite wide when i load in bfloat16. The difference in the intermediate outputs starts happening from first layernorm (where casting to f32, applying norm, and casting back to bf16 happens)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattdangerw / @abheesht17 did you get a chance to take a look at it?

@pass-lin
Copy link
Contributor

pass-lin commented Mar 1, 2025

I think it's necessary to check in detail where the error is. As much as possible, we should ensure that the fp32 error is around 1e-5 under the torch backend. The maximum error of bf16 should not exceed 1e-2.
I've also implemented a Keras model with a similar error, and this level of error would cause a significant decrease in inference performance, as well as repetition.

@shivance
Copy link
Collaborator Author

shivance commented Mar 8, 2025

@mattdangerw / @abheesht17 / @divyashreepathihalli How do you completely disable MPS backend with Keras?

Please take a look at latest conversion script, despite I am moving model to cpu using keras.device and also moving inputs, reversible embedding call step exits with

stacktrace

-> Keras 3 model and tokenizer loaded.
Traceback (most recent call last):
  File "/Users/anshuman/.pyenv/versions/3.11.10/lib/python3.11/runpy.py", line 198, in _run_module_as_main
    return _run_code(code, main_globals, None,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/anshuman/.pyenv/versions/3.11.10/lib/python3.11/runpy.py", line 88, in _run_code
    exec(code, run_globals)
  File "/Users/anshuman/.cursor/extensions/ms-python.debugpy-2024.6.0-darwin-arm64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
    cli.main()
  File "/Users/anshuman/.cursor/extensions/ms-python.debugpy-2024.6.0-darwin-arm64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/Users/anshuman/.cursor/extensions/ms-python.debugpy-2024.6.0-darwin-arm64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
    runpy.run_path(target, run_name="__main__")
  File "/Users/anshuman/.cursor/extensions/ms-python.debugpy-2024.6.0-darwin-arm64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
    return _run_module_code(code, init_globals, run_name,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/anshuman/.cursor/extensions/ms-python.debugpy-2024.6.0-darwin-arm64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/Users/anshuman/.cursor/extensions/ms-python.debugpy-2024.6.0-darwin-arm64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
    exec(code, run_globals)
  File "/Users/anshuman/Desktop/Projects/keras-hub/tools/checkpoint_conversion/convert_qwen_checkpoints.py", line 353, in <module>
    app.run(main)
  File "/Users/anshuman/.pyenv/versions/3.11.10/envs/qwen/lib/python3.11/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/Users/anshuman/.pyenv/versions/3.11.10/envs/qwen/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
             ^^^^^^^^^^
  File "/Users/anshuman/Desktop/Projects/keras-hub/tools/checkpoint_conversion/convert_qwen_checkpoints.py", line 336, in main
    test_model(keras_hub_model, keras_hub_tokenizer, hf_model, hf_tokenizer)
  File "/Users/anshuman/Desktop/Projects/keras-hub/tools/checkpoint_conversion/convert_qwen_checkpoints.py", line 228, in test_model
    keras_hub_output = keras_hub_model(keras_hub_inputs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/anshuman/.pyenv/versions/3.11.10/envs/qwen/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/Users/anshuman/.pyenv/versions/3.11.10/envs/qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/anshuman/.pyenv/versions/3.11.10/envs/qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/anshuman/.pyenv/versions/3.11.10/envs/qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/anshuman/.pyenv/versions/3.11.10/envs/qwen/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/anshuman/Desktop/Projects/keras-hub/keras_hub/src/layers/modeling/reversible_embedding.py", line 129, in call
    return super().call(inputs)
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/anshuman/.pyenv/versions/3.11.10/envs/qwen/lib/python3.11/site-packages/torch/nn/functional.py", line 2516, in embedding
    return handle_torch_function(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/anshuman/.pyenv/versions/3.11.10/envs/qwen/lib/python3.11/site-packages/torch/overrides.py", line 1720, in handle_torch_function
    result = mode.__torch_function__(public_api, types, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/anshuman/.pyenv/versions/3.11.10/envs/qwen/lib/python3.11/site-packages/torch/utils/_device.py", line 104, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/anshuman/.pyenv/versions/3.11.10/envs/qwen/lib/python3.11/site-packages/torch/nn/functional.py", line 2551, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Exception encountered when calling ReversibleEmbedding.call().

Placeholder storage has not been allocated on MPS device!

Arguments received by ReversibleEmbedding.call():
  • inputs=torch.Tensor(shape=torch.Size([1, 5]), dtype=int32)
  • reverse=False


the stacktrace points that somewhere allocation is still happening on MPS, which I have already disabled !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Qwen 2.5 to KerasHub
5 participants