Skip to content

Commit 56b3866

Browse files
EddyLXJfacebook-github-bot
authored andcommitted
Adding KVZCHEvictionTBEConfig in FBGEEM (#5058)
Summary: X-link: facebookresearch/FBGEMM#2067 X-link: meta-pytorch/torchrec#3442 Before KVZCH is using ID_COUNT and MEM_UTIL eviction trigger mode, both are very tricky and hard for model engineer to decide what num to use for the id count or mem util threshold. Besides that, the eviction start time is out of sync after some time in training, which can cause great qps drop during eviction. This diff is adding support for free memory trigger eviction. It will check how many free memory left every N batch in every rank and if free memory below the threshold, it will trigger eviction in all tbes of all ranks using all reduce. In this way, we can force the start time of eviction in all ranks. Differential Revision: D83896528
1 parent 793c4b6 commit 56b3866

File tree

1 file changed

+13
-0
lines changed

1 file changed

+13
-0
lines changed

fbgemm_gpu/fbgemm_gpu/split_table_batched_embeddings_ops_common.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -240,6 +240,19 @@ def validate(self) -> None:
240240
), "backend_return_whole_row can only be enabled when enable_optimizer_offloading is enabled"
241241

242242

243+
class KVZCHEvictionTBEConfig(NamedTuple):
244+
# Eviction trigger model for kvzch table: 0: disabled, 1: iteration, 2: mem_util, 3: manual, 4: id count, 5: free_mem
245+
kvzch_eviction_trigger_mode: Optional[int] = None
246+
# Minimum free memory (in GB) required before triggering eviction when using free_mem trigger mode.
247+
eviction_free_mem_threshold_gb: Optional[int] = None
248+
# Number of batches between checks for free memory threshold when using free_mem trigger mode.
249+
eviction_free_mem_check_interval_batch: Optional[int] = None
250+
# The width of each feature score bucket used for threshold calculation in feature score-based eviction.
251+
threshold_calculation_bucket_stride: Optional[float] = None
252+
# Total number of feature score buckets used for threshold calculation in feature score-based eviction.
253+
threshold_calculation_bucket_num: Optional[int] = None
254+
255+
243256
class BackendType(enum.IntEnum):
244257
SSD = 0
245258
DRAM = 1

0 commit comments

Comments
 (0)