Add Contrastive Learning #10097

jie-z-0607 · 2025-03-12T03:11:12Z

Before submitting

Lint code. If there are lint issues, please format the code first.

# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py

Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

New features

PR changes

Others

Description

1.增加对比学习相关目录与介绍；
2.增加Embedding Model训练时需要应用的数据处理和评估脚本：

query清洗；
强负样本挖掘；
模型推理评估。

3.适配多卡推理和faiss库，有效提高运行速度

paddle-bot · 2025-03-12T03:11:16Z

Thanks for your contribution!

ZHUI · 2025-03-12T03:13:52Z

slm/examples/contrastive_learning/clean_query.py

@@ -0,0 +1,238 @@
+from paddlenlp.transformers import AutoModel, AutoConfig


安装一些 pre-commit，刷一下格式。

ZHUI · 2025-03-12T03:14:20Z

slm/examples/contrastive_learning/embedding_evaluate.py

+import paddle
+import numpy as np
+
+class Embedding_Evaluation:


一些代码了来自其他地方的话，加一下链接出处

ZHUI · 2025-03-12T03:15:55Z

slm/examples/contrastive_learning/README.md

+neg_passage_path = './toy_data/toy_dev_neg.json'
+eval = Embedding_Evaluation(model_path, tokenizer_path, query_pos_passage_path, neg_passage_path)
+print(eval.evaluate())
+```


这个可以评估 mteb 的话，可以加一段如何评估的。加一个得分

ZHUI · 2025-03-12T03:16:08Z

slm/examples/contrastive_learning/README.md

+
+## 2.训练
+Embedding Model训练代码位置详见：
+- [run_embedding.py](../../../../llm/../PaddleNLP_zhangjie/llm/run_embedding.py)


目录有问题

ZHUI · 2025-03-12T03:17:07Z

slm/examples/contrastive_learning/README.md

+
+### 1.1 Query清洗
+Embedding Model进行对比学习时，数据质量尤为重要。通过多卡推理与faiss库，快速高效地对数据中的query进行清洗，能够有效去除低质量数据，避免出现过多的相似query，对模型训练造成干扰，影响训练效果。
+


可以讲一下原理。

ZHUI · 2025-03-12T03:17:19Z

slm/examples/contrastive_learning/README.md

+```
+
+### 1.2 负样本挖掘
+高质量的负样本数据，能够提高对比学习的效率，加快模型的收敛速度。通过多卡推理与faiss库，可以快速高效地挖掘数据中的负样本。


可以讲一下原理

codecov · 2025-03-12T03:46:57Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 49.99%. Comparing base (595e74f) to head (afd7be7).
Report is 169 commits behind head on develop.

❌ Your project check has failed because the head coverage (49.99%) is below the target coverage (58.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop   #10097      +/-   ##
===========================================
- Coverage    52.11%   49.99%   -2.12%     
===========================================
  Files          730      757      +27     
  Lines       116557   122442    +5885     
===========================================
+ Hits         60744    61219     +475     
- Misses       55813    61223    +5410

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

add_contrastive_learning

b12ecf0

paddle-bot bot added the contributor label Mar 12, 2025

paddle-bot bot assigned wawltor Mar 12, 2025

ZHUI reviewed Mar 12, 2025

View reviewed changes

jie-z-0607 added 3 commits March 12, 2025 15:21

fix_1

eb2fc6e

fix_2

41725e6

fix_3

afd7be7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Contrastive Learning #10097

Add Contrastive Learning #10097

jie-z-0607 commented Mar 12, 2025

paddle-bot bot commented Mar 12, 2025

ZHUI Mar 12, 2025

ZHUI Mar 12, 2025

ZHUI Mar 12, 2025

ZHUI Mar 12, 2025

ZHUI Mar 12, 2025

ZHUI Mar 12, 2025

codecov bot commented Mar 12, 2025 •

edited

Loading

		@@ -0,0 +1,238 @@
		from paddlenlp.transformers import AutoModel, AutoConfig


		### 1.1 Query清洗
		Embedding Model进行对比学习时，数据质量尤为重要。通过多卡推理与faiss库，快速高效地对数据中的query进行清洗，能够有效去除低质量数据，避免出现过多的相似query，对模型训练造成干扰，影响训练效果。

Add Contrastive Learning #10097

Are you sure you want to change the base?

Add Contrastive Learning #10097

Conversation

jie-z-0607 commented Mar 12, 2025

Before submitting

PR types

PR changes

Description

paddle-bot bot commented Mar 12, 2025

ZHUI Mar 12, 2025

Choose a reason for hiding this comment

ZHUI Mar 12, 2025

Choose a reason for hiding this comment

ZHUI Mar 12, 2025

Choose a reason for hiding this comment

ZHUI Mar 12, 2025

Choose a reason for hiding this comment

ZHUI Mar 12, 2025

Choose a reason for hiding this comment

ZHUI Mar 12, 2025

Choose a reason for hiding this comment

codecov bot commented Mar 12, 2025 • edited Loading

Codecov Report

codecov bot commented Mar 12, 2025 •

edited

Loading