Fix: Correct sample idxmap building in SFT data preprocessing script by Dear-Sloth · Pull Request #636 · alibaba/Pai-Megatron-Patch

Dear-Sloth · 2025-07-19T08:07:39Z

The original build_idxmap_sft_dataset.py script had two critical issues:

It yielded an empty dictionary for samples that were supposed to be omitted due to long input, leading to empty entries in the final idxmap dataset. This could affect downstream training tasks to fail.
The final sample count was misleading as it counted all processed raw samples, not the ones actually saved.

This commit resolves these issues by:

Modifying Encoder.encode to ensure no data is yielded for omitted samples.
Implementing accurate counters in Partition.process_json_file to track saved, truncated, and omitted samples.
Adding a comprehensive summary report at the end of the process for better transparency.

This ensures the integrity of the created dataset and provides clear, actionable statistics to the user.

The processing results of the updated script are as follows:

CLAassistant · 2025-07-19T08:07:46Z

All committers have signed the CLA.

Fix: Correct sample idxmap building in SFT preprocessing script

bf69ab4

Provide feedback