[webgpu] Restore MatMulNBits workgroup size for Phi-3.5 #23349

daijh · 2025-01-14T06:01:34Z

Description

This change restores the MatMulNBits workgroup size from (8, 8, 1) back to (16, 8, 1) to resolve a performance regression observed on Intel iGPUs during token generation (M=1).

Motivation and Context

As above.

This change restores the MatMulNBits workgroup size from (8, 8, 1) back to (16, 8, 1) to resolve a performance regression observed on Intel iGPUs during token generation (M=1). Signed-off-by: Jianhui Dai <[email protected]>

daijh · 2025-01-14T06:06:00Z

@[email protected] @jchen10

Please take a look before broader review.

jchen10 · 2025-01-14T06:28:45Z

Do you have more specific information on how much this regresses on what GPU platforms?

daijh · 2025-01-15T06:32:11Z

Phi-3.5 decoding performance on LNL decreased from approximately 28 tps to 26 tps.

The detailed profiling results below. MatMulNBits-1x8192x3072 seems the primary contributor to the performance difference.

**workgroup_size-8x8x1

Op	Count	Duration(us)	Percent
MatMulNBits-1x3072x8192	14528	2336019	33.97%
MatMulNBits-1x3072x3072	29056	2070941	30.12%
MatMulNBits-1x8192x3072	7264	1547771	22.51%
MultiHeadAttention	43584	494532	7.19%
SkipSimplifiedLayerNormalization	14528	154671	2.25%
MatMulNBits-1x3072x32064	227	127522	1.85%
RotaryEmbedding	14528	55148	0.80%
Mul	14528	35564	0.52%
SimplifiedLayerNormalization	227	29672	0.43%
Sigmoid	7264	15966	0.23%
Where	454	3294	0.05%
Cast	454	1172	0.02%
Gather	227	1164	0.02%
Tile	227	926	0.01%
Sub	227	673	0.01%
Add	227	644	0.01%
Expand	227	516	0.01%
ALL	147777	6876195	100.00%

**workgroup_size-16x8x1

Op	Count	Duration(us)	Percent
MatMulNBits-1x3072x8192	14528	2274354	34.67%
MatMulNBits-1x3072x3072	29056	1990174	30.34%
MatMulNBits-1x8192x3072	7264	1366201	20.83%
MultiHeadAttention	43584	487508	7.43%
SkipSimplifiedLayerNormalization	14528	159851	2.44%
MatMulNBits-1x3072x32064	227	123921	1.89%
RotaryEmbedding	14528	61867	0.94%
Mul	14528	35728	0.54%
SimplifiedLayerNormalization	227	30158	0.46%
Sigmoid	7264	21388	0.33%
Where	454	3533	0.05%
Gather	227	1217	0.02%
Cast	454	1175	0.02%
Tile	227	971	0.01%
Sub	227	689	0.01%
Add	227	632	0.01%
Expand	227	520	0.01%
ALL	147777	6559887	100.00%

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc

guschmue · 2025-01-19T22:34:43Z

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline

guschmue · 2025-01-19T22:34:54Z

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

azure-pipelines · 2025-01-19T22:34:58Z

Azure Pipelines successfully started running 2 pipeline(s).

guschmue · 2025-01-19T22:35:04Z

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

guschmue · 2025-01-19T22:35:15Z

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI

azure-pipelines · 2025-01-19T22:35:21Z

Azure Pipelines successfully started running 4 pipeline(s).

azure-pipelines · 2025-01-19T22:35:30Z

Azure Pipelines successfully started running 4 pipeline(s).

azure-pipelines · 2025-01-19T22:35:31Z

Azure Pipelines successfully started running 9 pipeline(s).

daijh · 2025-01-21T08:11:05Z

@guschmue @fs-eire, please take a look.

[webgpu] Restore MatMulNBits workgroup size for Phi-3.5

6e8fcb0

This change restores the MatMulNBits workgroup size from (8, 8, 1) back to (16, 8, 1) to resolve a performance regression observed on Intel iGPUs during token generation (M=1). Signed-off-by: Jianhui Dai <[email protected]>

guschmue added the ep:WebGPU ort-web webgpu provider label Jan 14, 2025

qjia7 reviewed Jan 15, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[webgpu] Restore MatMulNBits workgroup size for Phi-3.5 #23349

[webgpu] Restore MatMulNBits workgroup size for Phi-3.5 #23349

daijh commented Jan 14, 2025

daijh commented Jan 14, 2025

jchen10 commented Jan 14, 2025

daijh commented Jan 15, 2025

guschmue commented Jan 19, 2025

guschmue commented Jan 19, 2025

azure-pipelines bot commented Jan 19, 2025

guschmue commented Jan 19, 2025

guschmue commented Jan 19, 2025

azure-pipelines bot commented Jan 19, 2025

azure-pipelines bot commented Jan 19, 2025

azure-pipelines bot commented Jan 19, 2025

daijh commented Jan 21, 2025

[webgpu] Restore MatMulNBits workgroup size for Phi-3.5 #23349

Are you sure you want to change the base?

[webgpu] Restore MatMulNBits workgroup size for Phi-3.5 #23349

Conversation

daijh commented Jan 14, 2025

Description

Motivation and Context

daijh commented Jan 14, 2025

jchen10 commented Jan 14, 2025

daijh commented Jan 15, 2025

guschmue commented Jan 19, 2025

guschmue commented Jan 19, 2025

azure-pipelines bot commented Jan 19, 2025

guschmue commented Jan 19, 2025

guschmue commented Jan 19, 2025

azure-pipelines bot commented Jan 19, 2025

azure-pipelines bot commented Jan 19, 2025

azure-pipelines bot commented Jan 19, 2025

daijh commented Jan 21, 2025