Skip to content

Support ssd device propagation in Torch Rec for RecSys Inference #2961

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

faran928
Copy link
Contributor

@faran928 faran928 commented May 8, 2025

Summary:
For RecSys Inference when tables are offloaded onto SSD:

  1. Specify and propagate the tables to be offloaded to SSD in TorchRec via FUSED_PARAMS
  2. Continue using torch.device("cpu") as compute device while using separate input / output dist for SSD (as SSD kernel - EmbeddingDB is different than CPU kernel) by creating a new device group for SSD.

Would be renaming device_type_from_sharding_info to storage_device_type_from_sharding_info to clarify it better.

Differential Revision: D74378974

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 8, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D74378974

@faran928 faran928 force-pushed the export-D74378974 branch from abbc3e4 to 0559aff Compare May 9, 2025 17:03
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D74378974

faran928 added a commit to faran928/torchrec that referenced this pull request May 12, 2025
…orch#2961)

Summary:

For RecSys Inference when tables are offloaded onto SSD:

1. Specify and propagate the tables to be offloaded to SSD in TorchRec via FUSED_PARAMS as discussed with TroyGarden
2. Continue using torch.device("cpu") as compute device while using separate input / output dist for SSD (as in house SSD TBE kernel based on EmbeddingDB is different than CPU TBE kernel) by creating a new device group for SSD.

Would be renaming device_type_from_sharding_info to storage_device_type_from_sharding_info to clarify it better.

Differential Revision: D74378974
@faran928 faran928 force-pushed the export-D74378974 branch from 0559aff to 0274c28 Compare May 12, 2025 06:14
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D74378974

faran928 added a commit to faran928/torchrec that referenced this pull request May 12, 2025
…orch#2961)

Summary:
Pull Request resolved: pytorch#2961

For RecSys Inference when tables are offloaded onto SSD:

1. Specify and propagate the tables to be offloaded to SSD in TorchRec via FUSED_PARAMS as discussed with TroyGarden
2. Continue using torch.device("cpu") as compute device while using separate input / output dist for SSD (as in house SSD TBE kernel based on EmbeddingDB is different than CPU TBE kernel) by creating a new device group for SSD.

Would be renaming device_type_from_sharding_info to storage_device_type_from_sharding_info to clarify it better.

Differential Revision: D74378974
@faran928 faran928 force-pushed the export-D74378974 branch from 0274c28 to 41cbba8 Compare May 12, 2025 06:19
…orch#2961)

Summary:

For RecSys Inference when tables are offloaded onto SSD:

1. Specify and propagate the tables to be offloaded to SSD in TorchRec via FUSED_PARAMS as discussed with TroyGarden
2. Continue using torch.device("cpu") as compute device while using separate input / output dist for SSD (as in house SSD TBE kernel based on EmbeddingDB is different than CPU TBE kernel) by creating a new device group for SSD.

Would be renaming device_type_from_sharding_info to storage_device_type_from_sharding_info to clarify it better.

Reviewed By: jiayisuse

Differential Revision: D74378974
@faran928 faran928 force-pushed the export-D74378974 branch from 41cbba8 to 94c4638 Compare May 12, 2025 17:20
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D74378974

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants