Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add XLMRoberta in Embedding Train #10074

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

jie-z-0607
Copy link
Contributor

Before submitting

  • Lint code. If there are lint issues, please format the code first.
# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py
  • Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

New features

PR changes

Models

Description

一、在Embedding训练中增加对XLMRoberta模型的支持,可支持bge-m3及系列模型的微调训练:
1.在XLMRoberta的modeling文件中增加相关模型;
2.调整训练脚本中的模型选择与初始化相关代码;
3.调整embedding dataset相关脚本中的数据构造代码;
4.其他参数文件支持等

二、修复了原XLMRoberta模型recompute开启不正常的问题

Copy link

paddle-bot bot commented Mar 11, 2025

Thanks for your contribution!

ZHUI
ZHUI previously approved these changes Mar 11, 2025
Copy link
Collaborator

@ZHUI ZHUI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

b = np.tril(np.ones([cur_len, cur_len]), 0)
input_mask_data[0, 0, offset : offset + cur_len, offset : offset + cur_len] = b
b = np.ones([cur_len])
input_mask_data[0, offset : offset + cur_len] = b
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注意一下数据处理是否兼容

@@ -248,6 +253,7 @@ def main():
return_tensors="np",
return_attention_mask=not model_args.flash_mask,
pad_to_multiple_of=data_args.pad_to_multiple_of,
return_position_ids=False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改成参数里面可以配置吧

Comment on lines +111 to +114
if isinstance(model_config, XLMRobertaConfig):
model_class = XLMRobertaSentenceEmbedding
elif isinstance(model_config, Qwen2Config):
model_class = Qwen2SentenceEmbedding
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

后面可以考虑加一个 AutoModelForSentenceEmbedding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants