Add XLMRoberta in Embedding Train #10074

jie-z-0607 · 2025-03-11T06:24:02Z

Before submitting

Lint code. If there are lint issues, please format the code first.

# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py

Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

New features

PR changes

Models

Description

一、在Embedding训练中增加对XLMRoberta模型的支持，可支持bge-m3及系列模型的微调训练：
1.在XLMRoberta的modeling文件中增加相关模型；
2.调整训练脚本中的模型选择与初始化相关代码；
3.调整embedding dataset相关脚本中的数据构造代码；
4.其他参数文件支持等

二、修复了原XLMRoberta模型recompute开启不正常的问题

paddle-bot · 2025-03-11T06:24:08Z

Thanks for your contribution!

ZHUI

LGTM

ZHUI · 2025-03-11T06:32:50Z

paddlenlp/data/data_collator.py

-            b = np.tril(np.ones([cur_len, cur_len]), 0)
-            input_mask_data[0, 0, offset : offset + cur_len, offset : offset + cur_len] = b
+            b = np.ones([cur_len])
+            input_mask_data[0, offset : offset + cur_len] = b


注意一下数据处理是否兼容

ZHUI · 2025-03-11T06:33:28Z

llm/run_embedding.py

@@ -248,6 +253,7 @@ def main():
        return_tensors="np",
        return_attention_mask=not model_args.flash_mask,
        pad_to_multiple_of=data_args.pad_to_multiple_of,
+        return_position_ids=False


改成参数里面可以配置吧

ZHUI · 2025-03-11T06:35:12Z

llm/run_embedding.py

+    if isinstance(model_config, XLMRobertaConfig):
+        model_class = XLMRobertaSentenceEmbedding
+    elif isinstance(model_config, Qwen2Config):
+        model_class = Qwen2SentenceEmbedding


后面可以考虑加一个 AutoModelForSentenceEmbedding

add xlmroberta in embedding train

20ca758

paddle-bot bot added the contributor label Mar 11, 2025

paddle-bot bot assigned KB-Ding Mar 11, 2025

ZHUI previously approved these changes Mar 11, 2025

View reviewed changes

fix_1

21f54e3

jie-z-0607 dismissed ZHUI’s stale review via 21f54e3 March 12, 2025 03:52

fix_2

80ef3a9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add XLMRoberta in Embedding Train #10074

Add XLMRoberta in Embedding Train #10074

jie-z-0607 commented Mar 11, 2025

paddle-bot bot commented Mar 11, 2025

ZHUI left a comment

ZHUI Mar 11, 2025

ZHUI Mar 11, 2025

ZHUI Mar 11, 2025

Add XLMRoberta in Embedding Train #10074

Are you sure you want to change the base?

Add XLMRoberta in Embedding Train #10074

Conversation

jie-z-0607 commented Mar 11, 2025

Before submitting

PR types

PR changes

Description

paddle-bot bot commented Mar 11, 2025

ZHUI left a comment

Choose a reason for hiding this comment

ZHUI Mar 11, 2025

Choose a reason for hiding this comment

ZHUI Mar 11, 2025

Choose a reason for hiding this comment

ZHUI Mar 11, 2025

Choose a reason for hiding this comment