Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SYCL: Add fp16 type support to unary op kernels #12788

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

qnixsynapse
Copy link
Collaborator

@qnixsynapse qnixsynapse commented Apr 7, 2025

There are probably better ways to do this.

Need to disable fp16 support on devices which does not support fp16 in hardware.

Either we do this by checking if the build is compiled with GGML_SYCL_F16 compile flag and disable it in device_supports_op function or we add info about current hardware features and check using a function.

Need proper testing..

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 7, 2025
@NeoZhangJianyu
Copy link
Collaborator

I find there is no UT cases for FP16 be opened.
Could you change the UT cases and enable FP16 for the OPs?
And test them and make sure they are passed.

@qnixsynapse
Copy link
Collaborator Author

I find there is no UT cases for FP16 be opened. Could you change the UT cases and enable FP16 for the OPs? And test them and make sure they are passed.

I think there are UT cases that are present in test backend ops which was disabled at that time by me (in #12201 )

@qnixsynapse qnixsynapse marked this pull request as draft April 7, 2025 07:35
@NeoZhangJianyu
Copy link
Collaborator

OK, I suggest enabling them and use them test this PR.

@qnixsynapse
Copy link
Collaborator Author

OK, I suggest enabling them and use them test this PR.

Already enabled and tested.

@qnixsynapse
Copy link
Collaborator Author

qnixsynapse commented Apr 8, 2025

It seems that in actual inference of a fp16 model(gemma 2 2B F16 in this case), the intermediate hidden embeddings are converted to fp32:

call ggml_sycl_add done
call ggml_sycl_rms_norm
call ggml_sycl_rms_norm done
call ggml_sycl_mul
call ggml_sycl_mul done
[SYCL] ggml_sycl_cpy: Tensor supplied: f32 to f16
[SYCL] ggml_sycl_cpy: Tensor supplied: f32 to f16
call ggml_sycl_tanh: DST Tensor type: f32 <-----------------------
call ggml_sycl_tanh done
ggml_sycl_op_soft_max: F32 mask
[SYCL] call ggml_sycl_dup
[SYCL] ggml_sycl_cpy: Tensor supplied: f32 to f32
[SYCL] call ggml_sycl_dup done
call ggml_sycl_rms_norm
call ggml_sycl_rms_norm done
call ggml_sycl_mul
call ggml_sycl_mul done
call ggml_sycl_add
call ggml_sycl_add done
call ggml_sycl_rms_norm
call ggml_sycl_rms_norm done
call ggml_sycl_mul
call ggml_sycl_mul done
call ggml_sycl_gelu: DST Tensor type: f32 <------------------------
call ggml_sycl_gelu done
call ggml_sycl_mul
call ggml_sycl_mul done
call ggml_sycl_rms_norm
call ggml_sycl_rms_norm done
call ggml_sycl_mul
call ggml_sycl_mul done
call ggml_sycl_add
call ggml_sycl_add done

So, there is no way to test the numerical stability of the fp16 operations with the exception of test-backend-ops:


  GELU(type=f16,ne_a=[128,2,2,2],v=0): call ggml_sycl_gelu: DST Tensor type: f16
call ggml_sycl_gelu done
OK
  GELU(type=f16,ne_a=[5,7,11,13],v=0): call ggml_sycl_gelu: DST Tensor type: f16
call ggml_sycl_gelu done
OK
  GELU(type=f32,ne_a=[128,2,2,2],v=0): call ggml_sycl_gelu: DST Tensor type: f32
call ggml_sycl_gelu done
OK
  GELU(type=f32,ne_a=[5,7,11,13],v=0): call ggml_sycl_gelu: DST Tensor type: f32
call ggml_sycl_gelu done
OK
TANH(type=f16,ne_a=[128,2,2,2],v=0): call ggml_sycl_tanh: DST Tensor type: f16
call ggml_sycl_tanh done
OK
  TANH(type=f16,ne_a=[5,7,11,13],v=0): call ggml_sycl_tanh: DST Tensor type: f16
call ggml_sycl_tanh done
OK
  TANH(type=f32,ne_a=[128,2,2,2],v=0): call ggml_sycl_tanh: DST Tensor type: f32
call ggml_sycl_tanh done
OK
  TANH(type=f32,ne_a=[5,7,11,13],v=0): call ggml_sycl_tanh: DST Tensor type: f32
call ggml_sycl_tanh done
OK

I am marking this PR "ready for review" for now to get some comments from others.

@qnixsynapse qnixsynapse marked this pull request as ready for review April 8, 2025 05:34
const sycl::nd_item<3> &item_ct1) {
const int i = item_ct1.get_local_range(2) * item_ct1.get_group(2) +
item_ct1.get_local_id(2);

if (i >= k) {
return;
}
dst[i] = x[i] / (1.0f + sycl::native::exp(-x[i]));
dst[i] = x[i] / (to_T<T>(1.0f) + sycl::native::exp(-x[i]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sycl::exp over sycl::native::exp ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask why? Please note that I am not the author of the original code.

const sycl::nd_item<3> &item_ct1) {
const int i = item_ct1.get_local_range(2) * item_ct1.get_group(2) +
item_ct1.get_local_id(2);

if (i >= k) {
return;
}
dst[i] = 1.0f / (1.0f + sycl::native::exp(-x[i]));
dst[i] = 1.0f / (to_T<T>(1.0f) + sycl::native::exp(-x[i]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sycl::exp over sycl::native::exp ?

const sycl::nd_item<3> &item_ct1) {
const int i = item_ct1.get_local_range(2) * item_ct1.get_group(2) +
item_ct1.get_local_id(2);

if (i >= k) {
return;
}
dst[i] = sycl::fmin(1.0f, sycl::fmax(0.0f, (x[i] + 3.0f) / 6.0f));
dst[i] = sycl::fmin(to_T<T>(1.0f), sycl::fmax(to_T<T>(0.0f), (x[i] + to_T<T>(3.0f)) / to_T<T>(6.0f)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't hard sigmoid max(0, min(1, ((x + 1) / 2 )) ?
see: https://arxiv.org/pdf/1511.00363v3

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although not the original author, but this is correct too I think.

Here we are doing
hard_sigmoid(x) = clip(((x+3)/6), 0, 1).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants