Skip to content

Fix: Correct sample idxmap building in SFT data preprocessing script#636

Open
Dear-Sloth wants to merge 1 commit intoalibaba:mainfrom
Dear-Sloth:fix-sft-data-build-idx-map
Open

Fix: Correct sample idxmap building in SFT data preprocessing script#636
Dear-Sloth wants to merge 1 commit intoalibaba:mainfrom
Dear-Sloth:fix-sft-data-build-idx-map

Conversation

@Dear-Sloth
Copy link

The original build_idxmap_sft_dataset.py script had two critical issues:

  1. It yielded an empty dictionary for samples that were supposed to be omitted due to long input, leading to empty entries in the final idxmap dataset. This could affect downstream training tasks to fail.
  2. The final sample count was misleading as it counted all processed raw samples, not the ones actually saved.

This commit resolves these issues by:

  • Modifying Encoder.encode to ensure no data is yielded for omitted samples.
  • Implementing accurate counters in Partition.process_json_file to track saved, truncated, and omitted samples.
  • Adding a comprehensive summary report at the end of the process for better transparency.

This ensures the integrity of the created dataset and provides clear, actionable statistics to the user.

The processing results of the updated script are as follows:

QQ截图20250719160458

@CLAassistant
Copy link

CLAassistant commented Jul 19, 2025

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants