|
1 |
| -# sft_datasets |
| 1 | +# 开源SFT数据集整理 |
| 2 | + |
| 3 | + |
| 4 | + |
| 5 | +| 数据集 | 数目 | Lang | Task | Gen | 类型 | 来源 | 链接 | |
| 6 | +|-------------------------------------------------------------------------------------- |-------- |----- |----- |--- |----------------------------------------- |----------------------- |------------------------------------------------------------------------------------------ | |
| 7 | +| [belle\_cn](https://huggingface.co/BelleGroup) | 1079517 | CN | TS/MT | SI | 通用指令,数学推理,对话 | text-davunci-003 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/belle_cn) | |
| 8 | +| [firefly](https://github.com/yangjianxin1/Firefly) | 1649398 | CN | MT | COL | 23种nlp任务 | 收集中文数据集,人工书写指令模板 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/firefly) | |
| 9 | +| [GAOKAO](https://github.com/OpenLMLab/GAOKAO-Bench) | 2785 | CN | MT | COL | 高考中的多选,填空等问题 | 人工标注的数据集的收集 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/GAOKAO) | |
| 10 | +| [COIG](https://huggingface.co/datasets/BAAI/COIG) | 298428 | CN | MT | COL | 考试,翻译,价值观指令数据集搜集,基于知识图谱的反事实对话 | 自动化工具+人工验证 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/COIG) | |
| 11 | +| [pCLUE](https://github.com/CLUEbenchmark/pCLUE) | 1200705 | CN | MT | | 73个Prompt,分类,推理,关键词识别,阅读理解等9个NLP任务 | | [下载](https://github.com/CLUEbenchmark/pCLUE/tree/main/datasets) | |
| 12 | +| [CSL](https://github.com/ydli-ai/CSL) | 396209 | CN | MT | | 40万中文论文元数据,26个Prompt | | [下载](https://drive.google.com/file/d/1xEDgtqHU4qm0Sp-dKjc5KerAmWydmh3-/view?usp=sharing) | |
| 13 | +| [CNewSum](https://dqwang122.github.io/projects/CNewSum/) | 304307 | CN | TS | | 字节与UCSB发布的中文摘要数据集 | | [下载](https://drive.google.com/u/0/uc?id=1A_YcQ3cBAI7u9iVIoCeVLLgwU7UUzHHv&export=download) | |
| 14 | +| [Coco-cn](https://github.com/li-xirong/coco-cn) | | CN | TS | | 图文多模态 | | [下载](https://github.com/li-xirong/coco-cn) | |
| 15 | +| [news\_commentary](https://huggingface.co/datasets/news_commentary/viewer/en-zh/train) | 69200 | EN/CN | TS | | 中英文翻译数据 | | [下载](https://huggingface.co/datasets/news_commentary/viewer/en-zh/train) | |
| 16 | +| [Chain of Thought](https://github.com/google-research/FLAN) | 74771 | EN/CN | MT | HG | CoT相关任务 | 人在现有数据集上标注CoT | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Chain-of-Thought) | |
| 17 | +| [HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3) | 37175 | EN/CN | TS | MIX | 对话评估 | gpt-3.5 或 人工 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/HC3) | |
| 18 | +| [instinwild](https://github.com/XueFuzhao/InstructionWild) | 52191 | EN/CN | MT | SI | 生成,开放域问答,头脑风暴 | text-davunci-003 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/instinwild) | |
| 19 | +| [Alpaca\_GPT4](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM) | 52002 | EN/CN | MT | SI | 通用指令 | GPT-4 生成的Alpaca数据 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/alpacaGPT4) | |
| 20 | +| [MOSS](https://github.com/OpenLMLab/MOSS) | 1583595 | EN/CN | SI | | | | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/MOSS) | |
| 21 | +| [LLMZoo](https://github.com/FreedomIntelligence/LLMZoo) | | ML | | | | | [下载](https://huggingface.co/datasets/FreedomIntelligence/phoenix-sft-data-v1/tree/main) | |
| 22 | +| [Guanaco](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset) | 534610 | ML | MT | SI | 多种nlp任务 | text-davinci-003 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Guanaco) | |
| 23 | +| [Natural Instructions](https://github.com/allenai/natural-instructions) | 5040134 | ML | MT | COL | 多种nlp任务 | 人工标注的数据集的收集 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Natural-Instructions) | |
| 24 | +| [xP3](https://huggingface.co/datasets/bigscience/xP3) | 78883588 | ML | MT | COL | 多种nlp任务 | 人工标注的数据集的收集 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/xP3) | |
| 25 | +| [alpaca](https://github.com/tatsu-lab/stanford_alpaca) | 52002 | EN | MT | SI | 通用指令 | text-davinci-003 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/alpaca) | |
| 26 | +| [GPT4all](https://github.com/nomic-ai/gpt4all) | 806199 | EN | MT | COL | 代码,故事,对话 | GPT-3.5-turbo 蒸馏 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/GPT4all) | |
| 27 | +| [GPTeacher](https://github.com/teknium1/GPTeacher) | 29013 | EN | MT | SI | 通用,角色扮演,工具指令 | GPT-4 & toolformer | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/GPTeacher) | |
| 28 | +| [prosocial dialog](https://huggingface.co/datasets/allenai/prosocial-dialog) | 165681 | EN | TS | MIX | 对话 | GPT-3改写问题,人工回复 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/prosocial-dialog) | |
| 29 | +| [finance\_en](https://huggingface.co/datasets/gbharti/finance-alpaca) | 68912 | EN | TS | COL | 金融领域问答 | GPT3.5 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/) | |
| 30 | +| [instruct](https://huggingface.co/datasets/swype/instruct) | 888969 | EN | MT | COL | GPT4All,Alpaca和开源数据集的增强 | 使用AllenAI提供的nlp增强工具 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/instruct) | |
| 31 | +| [Code Alpaca](https://github.com/sahil280114/codealpaca) | 20022 | EN | SI | SI | 代码生成,编辑,优化 | text-davinci-003 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/CodeAlpaca) | |
| 32 | +| [webGPT](https://huggingface.co/datasets/openai/webgpt_comparisons) | 18994 | EN | TS | MIX | 信息检索问答 | fine-tuned GPT-3 + 人工评估 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/webGPT) | |
| 33 | +| [dolly 2.0](https://github.com/databrickslabs/dolly) | 15015 | EN | TS | HG | 公开、封闭式问答、信息抽取、摘要生成、开放式构思、分类以及创意写作七类任务 | 人工标注 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/dolly) | |
| 34 | +| [baize](https://github.com/project-baize/baize-chatbot) | 653699 | EN | MT | COL | Alpaca和多种问答任务 | 人工标注的数据集的收集 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/baize) | |
| 35 | +| [hh-rlhf](https://github.com/anthropics/hh-rlhf) | 284517 | EN | TS | MIX | 对话 | RLHF models | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/hh-rlhf) | |
| 36 | +| [OIG(part)](https://laion.ai/blog/oig-dataset/) | 49237 | EN | MT | COL | 多种nlp任务 | 人工标注的数据集的收集和数据增强 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/OIG) | |
| 37 | +| [camel](https://github.com/lightaime/camel) | 760620 | EN | MT | SI | 物理生物化学编程,数学,社会等领域的角色扮演对话人工标注的数据集的收集 | gpt-3.5-turbo 生成 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/camel) | |
| 38 | +| [FLAN-Muffin](https://huggingface.co/datasets/Muennighoff/flan) | 1764800 | EN | MT | COL | 60种nlp任务 | 人工标注的数据集的收集 | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/FLAN-Muffin) | |
| 39 | +| [GPT4Tools](https://github.com/StevenGrove/GPT4Tools) | 71446 | EN | MT | SI | a collection of tool-related instructions | gpt-3.5-turbo | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/gpt4tools) | |
| 40 | +| [ShareChat](https://huggingface.co/datasets/RyokoAI/ShareGPT52K) | 1663241 | EN | MT | MIX | general instruct | 收集ShareGPT | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/ShareGPT) | |
| 41 | +| [Auto CoT](https://github.com/amazon-science/auto-cot) | | EN | | | | | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Auto-CoT) | |
| 42 | +| [ultrachat](https://github.com/thunlp/UltraChat) | 28247446 | EN | | | | | [下载](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/ultrachat) | |
| 43 | +| [StackLLaMA](https://huggingface.co/datasets/lvwerra/stack-exchange-paired) | todo | EN | | | | | | |
0 commit comments