support multiheadattention int8 #3940

tpoisonooo · 2022-06-21T07:23:17Z

这是在干啥

支持 mha int8 kernel

GEMM weight 都还是 per-channel 量化
内部需要 5 个 input scale 参数
- xq/xk/xv 的 scale
- softmax 之前的 scale
- 乘 out_weight 之前的 scale

速度对比（wsl2 虚拟机）

1 线程

$ ./benchncnn  10 1
loop_count = 10
num_threads = 1
powersave = 0
gpu_device = -1
cooling_down = 1
  vision_transformer  min = 2955.98  max = 3130.18  avg = 3051.40
vision_transformer_int8  min = 2403.91  max = 2459.07  avg = 2431.06

8 线程

$ ./benchncnn
loop_count = 4
num_threads = 8
powersave = 0
gpu_device = -1
cooling_down = 1
  vision_transformer  min = 1175.01  max = 1575.90  avg = 1343.40
vision_transformer_int8  min = 1076.93  max = 1153.30  avg = 1109.33

softmax 数值结果对比

直接量化 mha/conv/gemm 三类 opr 版本，不校准 bias
(base) khj@khj:~/ncnn/ninjabuild/examples$ ./vision_transformer
data size 1769472
output shape whc 1000,1,1
softmax result: 65 0.978581

浮点版本
(base) khj@khj:~/ncnn/ninjabuild/examples$ ./vision_transformer_fp32
data size 1769472
output shape whc 1000,1,1
softmax result: 65 0.985758

备注

需要先处理 PR 3911，我 rebase 一下。
或者直接 review 这个，也是一样的。

精度测试

pytorch fp32 原始模型，完整的 5w 张图
top-1 84.01%
top-5 97.08%

基线：ncnn fp32 原始模型，CPU 推理太慢了只能跑 2000 张
2022-06-28 17:49:46,793 - test - INFO - accuracy_top-1 : 83.55
2022-06-28 17:49:46,799 - test - INFO - accuracy_top-5 : 97.55

量化 conv+mha
2022-06-28 14:26:39,188 - test - INFO - accuracy_top-1 : 83.25
2022-06-28 14:26:39,194 - test - INFO - accuracy_top-5 : 97.65

量化 conv+mha+gemm
2022-06-27 21:05:06,841 - test - INFO - accuracy_top-1 : 82.55
2022-06-27 21:05:06,844 - test - INFO - accuracy_top-5 : 97.45

量化 conv+mha+gemm+bias 校准
2022-06-29 12:31:18,982 - test - INFO - accuracy_top-1 : 82.80
2022-06-29 12:31:18,984 - test - INFO - accuracy_top-5 : 97.55

结论：mha +conv 直接量化会影响 -0.3%； gemm 直接量化会影响 -0.7%，用 bias 校准可以救回来 +0.25%。

naive 整体加速 20%，掉点 -0.75%，模型大小 337MB->86MB

…t8-toml

…nto ncnn-int8-toml

…-mha-int8

…into support-mha-int8

codecov-commenter · 2022-06-23T13:55:52Z

Codecov Report

Merging #3940 (3f1844b) into master (8c06103) will decrease coverage by 0.18%.
The diff coverage is 9.72%.

@@            Coverage Diff             @@
##           master    #3940      +/-   ##
==========================================
- Coverage   93.84%   93.65%   -0.19%     
==========================================
  Files         721      728       +7     
  Lines      175071   177009    +1938     
==========================================
+ Hits       164291   165778    +1487     
- Misses      10780    11231     +451

Impacted Files	Coverage Δ
src/layer/multiheadattention.cpp	`47.82% <9.72%> (-45.41%)`	⬇️
src/command.cpp	`72.70% <0.00%> (-14.94%)`	⬇️
src/pipeline.cpp	`58.69% <0.00%> (-2.18%)`	⬇️
src/layer/vulkan/reshape_vulkan.cpp	`92.01% <0.00%> (-2.14%)`	⬇️
src/layer/x86/cast_x86.cpp	`96.07% <0.00%> (-1.91%)`	⬇️
src/layer/vulkan/packing_vulkan.cpp	`81.70% <0.00%> (-1.88%)`	⬇️
src/layer/vulkan/permute_vulkan.cpp	`96.99% <0.00%> (-1.60%)`	⬇️
src/layer/vulkan/reorg_vulkan.cpp	`96.35% <0.00%> (-1.57%)`	⬇️
src/layer/vulkan/pixelshuffle_vulkan.cpp	`96.35% <0.00%> (-1.57%)`	⬇️
src/layer/vulkan/flatten_vulkan.cpp	`95.97% <0.00%> (-1.51%)`	⬇️
... and 49 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c06103...3f1844b. Read the comment docs.

…into support-mha-int8

tpoisonooo and others added 30 commits June 13, 2022 17:14

feat(tools/quantize): support toml

2a5a296

apply code-format changes

e8ad914

feat(tools/quantize): add .ini parser

77e6546

apply code-format changes

8e2f806

improvement(tools/quantize): add ini config

146b8ba

Merge branch 'master' of https://github.com/tencent/ncnn into ncnn-in…

12f075f

…t8-toml

Merge branch 'ncnn-int8-toml' of https://github.com/tpoisonooo/ncnn i…

f719ee7

…nto ncnn-int8-toml

improvement(tools/quantize): refactor code

9863b26

apply code-format changes

1612caf

test(tools/quantize/ncnn2int8): test quant sqznet

be66fac

improvement(CMakeLists): downgrade to cxx11

ba6640d

apply code-format changes

d106fc0

Update CMakeLists.txt

fab112d

Update ncnn2table.cpp

77cf07a

Merge branch 'ncnn-int8-toml' of https://github.com/tpoisonooo/ncnn i…

9262515

…nto ncnn-int8-toml

fix(CI): remove cxx17 grammar

9d473f5

fix(tools/quantize): typo

181714e

docs(ncnn2int8): add ini description

b32dd56

feat(ncnn2int8): parse mha

12bef90

feat(src/layer): add mha int8

c7641ca

apply code-format changes

f20318b

feat(src/layer): add mha int8

4de1aff

Merge branch 'master' of https://github.com/tencent/ncnn into support…

acedd44

…-mha-int8

Merge branch 'support-mha-int8' of https://github.com/tpoisonooo/ncnn …

9d743fe

…into support-mha-int8

feat(src/layer): mha int8 input transform

2428661

apply code-format changes

5305e50

feat(src/layer/multiheadattention): add log_int_softmax

8d276f4

Merge branch 'support-mha-int8' of https://github.com/tpoisonooo/ncnn …

a560617

…into support-mha-int8

apply code-format changes

75061d9

feat(src/layer): log_int_softmax

30d6388

tpoisonooo and others added 4 commits June 21, 2022 22:02

fix(multiheadattention.cpp): load bias

de0e76a

fix(src/layer): model load size error

449f9cb

fix(net_quantize.cpp): weight scale

3c96faa

apply code-format changes

c81850e

tpoisonooo and others added 8 commits June 24, 2022 21:42

fix(lis): scale error

83e3368

fix(mha): single opr precision

58df666

improvement(mha): fp32 version using fake quant

b958cab

fix(mha): remove LIS and get good precision

0843acf

Merge branch 'support-mha-int8' of https://github.com/tpoisonooo/ncnn …

527b03a

…into support-mha-int8

apply code-format changes

aa6e791

improvement(mha): quantize softmax output

bdf52ab

apply code-format changes

1bf72dc

tpoisonooo mentioned this pull request Jun 26, 2022

WIP: ncnn ViT int8 OpenPPL/ppq#154

Open

improvement(benchmark): clean code

9258065

tpoisonooo changed the title ~~WIP: mha int8~~ support multiheadattention int8 Jun 26, 2022

tpoisonooo changed the title ~~support multiheadattention int8~~ WIP: support multiheadattention int8 Jun 26, 2022

tpoisonooo and others added 4 commits June 26, 2022 17:46

docs(operators.md): update mha

6c7d992

revert(src/layer/mha): do not quantize softmax

3f1844b

improvement(test): add mha test

240137b

apply code-format changes

14d45ab

tpoisonooo changed the title ~~WIP: support multiheadattention int8~~ support multiheadattention int8 Jun 29, 2022

tpoisonooo mentioned this pull request Jul 28, 2022

improve vit int8 mha opr #4096

Closed

tpoisonooo and others added 6 commits July 28, 2022 18:40

fix(CI): rebase code

c9f430f

Merge branch 'support-mha-int8' of https://github.com/tpoisonooo/ncnn …

66ed718

…into support-mha-int8

apply code-format changes

435e380

fix(CI): test mha exceeding

497dbd7

fix(src/layer/mha): miss convert weight to int8

5c5a586

apply code-format changes

8c44ccf

EdVince mentioned this pull request Jan 19, 2023

[ARM] Multiheadattention #4463

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support multiheadattention int8 #3940

support multiheadattention int8 #3940

tpoisonooo commented Jun 21, 2022 •

edited

Loading

codecov-commenter commented Jun 23, 2022 •

edited

Loading

support multiheadattention int8 #3940

Are you sure you want to change the base?

support multiheadattention int8 #3940

Conversation

tpoisonooo commented Jun 21, 2022 • edited Loading

这是在干啥

速度对比 （wsl2 虚拟机）

softmax 数值结果对比

备注

精度测试

codecov-commenter commented Jun 23, 2022 • edited Loading

Codecov Report

tpoisonooo commented Jun 21, 2022 •

edited

Loading

速度对比（wsl2 虚拟机）

codecov-commenter commented Jun 23, 2022 •

edited

Loading