Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Contrastive Learning #10097

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

jie-z-0607
Copy link
Contributor

Before submitting

  • Lint code. If there are lint issues, please format the code first.
# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py
  • Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

New features

PR changes

Others

Description

1.增加对比学习相关目录与介绍;
2.增加Embedding Model训练时需要应用的数据处理和评估脚本:

  • query清洗;
  • 强负样本挖掘;
  • 模型推理评估。

3.适配多卡推理和faiss库,有效提高运行速度

Copy link

paddle-bot bot commented Mar 12, 2025

Thanks for your contribution!

@@ -0,0 +1,238 @@
from paddlenlp.transformers import AutoModel, AutoConfig
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

安装一些 pre-commit,刷一下格式。

import paddle
import numpy as np

class Embedding_Evaluation:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一些代码了来自其他地方的话,加一下链接出处

neg_passage_path = './toy_data/toy_dev_neg.json'
eval = Embedding_Evaluation(model_path, tokenizer_path, query_pos_passage_path, neg_passage_path)
print(eval.evaluate())
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个可以评估 mteb 的话,可以加一段如何评估的。加一个得分


## 2.训练
Embedding Model训练代码位置详见:
- [run_embedding.py](../../../../llm/../PaddleNLP_zhangjie/llm/run_embedding.py)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目录有问题


### 1.1 Query清洗
Embedding Model进行对比学习时,数据质量尤为重要。通过多卡推理与faiss库,快速高效地对数据中的query进行清洗,能够有效去除低质量数据,避免出现过多的相似query,对模型训练造成干扰,影响训练效果。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以讲一下原理。

```

### 1.2 负样本挖掘
高质量的负样本数据,能够提高对比学习的效率,加快模型的收敛速度。通过多卡推理与faiss库,可以快速高效地挖掘数据中的负样本。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以讲一下原理

Copy link

codecov bot commented Mar 12, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 49.99%. Comparing base (595e74f) to head (afd7be7).
Report is 169 commits behind head on develop.

❌ Your project check has failed because the head coverage (49.99%) is below the target coverage (58.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop   #10097      +/-   ##
===========================================
- Coverage    52.11%   49.99%   -2.12%     
===========================================
  Files          730      757      +27     
  Lines       116557   122442    +5885     
===========================================
+ Hits         60744    61219     +475     
- Misses       55813    61223    +5410     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants