-
Notifications
You must be signed in to change notification settings - Fork 306
Support activation quantization #2607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This pull request was exported from Phabricator. Differential Revision: D70522237 |
Summary: X-link: pytorch/benchmark#2607 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba Differential Revision: D70522237
Summary: X-link: pytorch/pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Differential Revision: D70522237
fa858a9
to
8bade0b
Compare
This pull request was exported from Phabricator. Differential Revision: D70522237 |
…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Differential Revision: D70522237
Summary: X-link: pytorch/pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Differential Revision: D70522237
8bade0b
to
c8e15bf
Compare
This pull request was exported from Phabricator. Differential Revision: D70522237 |
…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Differential Revision: D70522237
Summary: X-link: pytorch/pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Differential Revision: D70522237
c8e15bf
to
413d00d
Compare
This pull request was exported from Phabricator. Differential Revision: D70522237 |
…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Differential Revision: D70522237
Summary: X-link: pytorch/pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Differential Revision: D70522237
413d00d
to
e5a0b88
Compare
This pull request was exported from Phabricator. Differential Revision: D70522237 |
…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Differential Revision: D70522237
This pull request was exported from Phabricator. Differential Revision: D70522237 |
…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 Pull Request resolved: pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Differential Revision: D70522237
…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 Pull Request resolved: pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Differential Revision: D70522237
0483b01
to
7b6e8cc
Compare
Summary: X-link: pytorch/pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Differential Revision: D70522237
This pull request was exported from Phabricator. Differential Revision: D70522237 |
…ng (#148380) Summary: X-link: pytorch/benchmark#2607 Pull Request resolved: #148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4 ### QPS {F1977040587} ### memory baseline {F1977040640} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot with fp8 {F1977040641} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot ### conclusion: - ~9% qps improvement, reduces peak memory from 82.01 to 78.97. - for NE, we need have longer verification, WIP with scaling version. Differential Revision: D70522237
…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 Pull Request resolved: pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4 ### QPS {F1977040587} ### memory baseline {F1977040640} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot with fp8 {F1977040641} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot ### conclusion: - ~9% qps improvement, reduces peak memory from 82.01 to 78.97. - for NE, we need have longer verification, WIP with scaling version. Differential Revision: D70522237
…ng (#148380) Summary: X-link: pytorch/benchmark#2607 Pull Request resolved: #148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4 ### QPS {F1977040587} ### memory baseline {F1977040640} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot with fp8 {F1977040641} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot ### conclusion: - ~9% qps improvement, reduces peak memory from 82.01 to 78.97. - for NE, we need have longer verification, WIP with scaling version. Differential Revision: D70522237
…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 Pull Request resolved: pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4 ### QPS {F1977040587} ### memory baseline {F1977040640} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot with fp8 {F1977040641} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot ### conclusion: - ~9% qps improvement, reduces peak memory from 82.01 to 78.97. - for NE, we need have longer verification, WIP with scaling version. Differential Revision: D70522237
…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 Pull Request resolved: pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4 ### QPS {F1977040587} ### memory baseline {F1977040640} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot with fp8 {F1977040641} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot ### conclusion: - ~9% qps improvement, reduces peak memory from 82.01 to 78.97. - for NE, we need have longer verification, WIP with scaling version. Differential Revision: D70522237
…ng (#148380) Summary: X-link: pytorch/benchmark#2607 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4 ### QPS {F1977040587} ### memory baseline {F1977040640} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot with fp8 {F1977040641} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot ### conclusion: - ~9% qps improvement, reduces peak memory from 82.01 to 78.97. - for NE, we need have longer verification, WIP with scaling version. Differential Revision: D70522237
7b6e8cc
to
afb3bf3
Compare
Summary: X-link: pytorch/pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Differential Revision: D70522237
This pull request was exported from Phabricator. Differential Revision: D70522237 |
…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4 ### QPS {F1977040587} ### memory baseline {F1977040640} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot with fp8 {F1977040641} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot ### conclusion: - ~9% qps improvement, reduces peak memory from 82.01 to 78.97. - for NE, we need have longer verification, WIP with scaling version. Differential Revision: D70522237
Summary: X-link: pytorch/pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Differential Revision: D70522237
afb3bf3
to
2aca56e
Compare
This pull request was exported from Phabricator. Differential Revision: D70522237 |
…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 Pull Request resolved: pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4 ### QPS {F1977040587} ### memory baseline {F1977040640} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot with fp8 {F1977040641} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot ### conclusion: - ~9% qps improvement, reduces peak memory from 82.01 to 78.97. - for NE, we need have longer verification, WIP with scaling version. Differential Revision: D70522237
…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 Pull Request resolved: pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4 ### QPS {F1977040587} ### memory baseline {F1977040640} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot with fp8 {F1977040641} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot ### conclusion: - ~9% qps improvement, reduces peak memory from 82.01 to 78.97. - for NE, we need have longer verification, WIP with scaling version. Differential Revision: D70522237
…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 Pull Request resolved: pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4 ### QPS {F1977040587} ### memory baseline {F1977040640} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot with fp8 {F1977040641} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot ### conclusion: - ~9% qps improvement, reduces peak memory from 82.01 to 78.97. - for NE, we need have longer verification, WIP with scaling version. Differential Revision: D70522237
Summary:
X-link: pytorch/pytorch#148380
We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.
Differential Revision: D70522237