Skip to content

Support activation quantization #2607

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mengluy0125
Copy link
Contributor

Summary:
X-link: pytorch/pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Differential Revision: D70522237

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

mengluy0125 added a commit to mengluy0125/pytorch that referenced this pull request Apr 9, 2025
Summary:
X-link: pytorch/benchmark#2607


We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

Differential Revision: D70522237
mengluy0125 added a commit to mengluy0125/benchmark that referenced this pull request Apr 9, 2025
Summary:

X-link: pytorch/pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

mengluy0125 added a commit to mengluy0125/pytorch that referenced this pull request Apr 10, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607


We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Differential Revision: D70522237
mengluy0125 added a commit to mengluy0125/benchmark that referenced this pull request Apr 10, 2025
Summary:

X-link: pytorch/pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

mengluy0125 added a commit to mengluy0125/pytorch that referenced this pull request Apr 10, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607


We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Differential Revision: D70522237
mengluy0125 added a commit to mengluy0125/benchmark that referenced this pull request Apr 10, 2025
Summary:

X-link: pytorch/pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

mengluy0125 added a commit to mengluy0125/pytorch that referenced this pull request Apr 10, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607


We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Differential Revision: D70522237
mengluy0125 added a commit to mengluy0125/benchmark that referenced this pull request Apr 10, 2025
Summary:

X-link: pytorch/pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

mengluy0125 added a commit to mengluy0125/pytorch that referenced this pull request Apr 10, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607


We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

mengluy0125 pushed a commit to mengluy0125/pytorch that referenced this pull request Apr 21, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607

Pull Request resolved: pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Differential Revision: D70522237
mengluy0125 pushed a commit to mengluy0125/pytorch that referenced this pull request Apr 21, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607

Pull Request resolved: pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Differential Revision: D70522237
mengluy0125 added a commit to mengluy0125/benchmark that referenced this pull request Apr 22, 2025
Summary:

X-link: pytorch/pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

pytorch-bot bot pushed a commit to pytorch/pytorch that referenced this pull request Apr 22, 2025
…ng (#148380)

Summary:
X-link: pytorch/benchmark#2607

Pull Request resolved: #148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
mengluy0125 pushed a commit to mengluy0125/pytorch that referenced this pull request Apr 22, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607

Pull Request resolved: pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
pytorch-bot bot pushed a commit to pytorch/pytorch that referenced this pull request Apr 24, 2025
…ng (#148380)

Summary:
X-link: pytorch/benchmark#2607

Pull Request resolved: #148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
mengluy0125 pushed a commit to mengluy0125/pytorch that referenced this pull request Apr 24, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607

Pull Request resolved: pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
mengluy0125 pushed a commit to mengluy0125/pytorch that referenced this pull request Apr 24, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607

Pull Request resolved: pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
pytorch-bot bot pushed a commit to pytorch/pytorch that referenced this pull request May 1, 2025
…ng (#148380)

Summary:
X-link: pytorch/benchmark#2607


We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000


baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot


### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
mengluy0125 added a commit to mengluy0125/benchmark that referenced this pull request May 1, 2025
Summary:

X-link: pytorch/pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

mengluy0125 added a commit to mengluy0125/pytorch that referenced this pull request May 1, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607


We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000


baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot


### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
Summary:

X-link: pytorch/pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

mengluy0125 pushed a commit to mengluy0125/pytorch that referenced this pull request May 1, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607

Pull Request resolved: pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
mengluy0125 pushed a commit to mengluy0125/pytorch that referenced this pull request May 1, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607

Pull Request resolved: pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
mengluy0125 pushed a commit to mengluy0125/pytorch that referenced this pull request May 1, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607

Pull Request resolved: pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants