-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Contrastive Learning #10097
base: develop
Are you sure you want to change the base?
Add Contrastive Learning #10097
Conversation
Thanks for your contribution! |
@@ -0,0 +1,238 @@ | |||
from paddlenlp.transformers import AutoModel, AutoConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
安装一些 pre-commit,刷一下格式。
import paddle | ||
import numpy as np | ||
|
||
class Embedding_Evaluation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
一些代码了来自其他地方的话,加一下链接出处
neg_passage_path = './toy_data/toy_dev_neg.json' | ||
eval = Embedding_Evaluation(model_path, tokenizer_path, query_pos_passage_path, neg_passage_path) | ||
print(eval.evaluate()) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个可以评估 mteb 的话,可以加一段如何评估的。加一个得分
|
||
## 2.训练 | ||
Embedding Model训练代码位置详见: | ||
- [run_embedding.py](../../../../llm/../PaddleNLP_zhangjie/llm/run_embedding.py) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目录有问题
|
||
### 1.1 Query清洗 | ||
Embedding Model进行对比学习时,数据质量尤为重要。通过多卡推理与faiss库,快速高效地对数据中的query进行清洗,能够有效去除低质量数据,避免出现过多的相似query,对模型训练造成干扰,影响训练效果。 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以讲一下原理。
``` | ||
|
||
### 1.2 负样本挖掘 | ||
高质量的负样本数据,能够提高对比学习的效率,加快模型的收敛速度。通过多卡推理与faiss库,可以快速高效地挖掘数据中的负样本。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以讲一下原理
Codecov ReportAll modified and coverable lines are covered by tests ✅
❌ Your project check has failed because the head coverage (49.99%) is below the target coverage (58.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #10097 +/- ##
===========================================
- Coverage 52.11% 49.99% -2.12%
===========================================
Files 730 757 +27
Lines 116557 122442 +5885
===========================================
+ Hits 60744 61219 +475
- Misses 55813 61223 +5410 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Before submitting
tests
folder. If there are codecov issues, please add tests cases first.PR types
New features
PR changes
Others
Description
1.增加对比学习相关目录与介绍;
2.增加Embedding Model训练时需要应用的数据处理和评估脚本:
3.适配多卡推理和faiss库,有效提高运行速度