diff --git a/end-to-end-use-cases/NotebookLlama/README.md b/end-to-end-use-cases/NotebookLlama/README.md index 12019745e..d46ce36be 100644 --- a/end-to-end-use-cases/NotebookLlama/README.md +++ b/end-to-end-use-cases/NotebookLlama/README.md @@ -1,4 +1,6 @@ -## NotebookLlama: An Open Source version of NotebookLM +## NotebookLlama: PDF to Podcast using Llama models + +> Note: We have updated this to support Llama API, sign up [here](http://llama.com) ![NotebookLlama](./resources/Outline.jpg) @@ -15,13 +17,13 @@ It assumes zero knowledge of LLMs, prompting and audio models, everything is cov Here is step by step thought (pun intended) for the task: - Step 1: Pre-process PDF: Use `Llama-3.2-1B-Instruct` to pre-process the PDF and save it in a `.txt` file. -- Step 2: Transcript Writer: Use `Llama-3.1-70B-Instruct` model to write a podcast transcript from the text -- Step 3: Dramatic Re-Writer: Use `Llama-3.1-8B-Instruct` model to make the transcript more dramatic +- Step 2: Transcript Writer: Use `Llama-4-Maverick` model to write a podcast transcript from the text +- Step 3: Dramatic Re-Writer: Use `Llama-3-8B-Instruct` model to make the transcript more dramatic - Step 4: Text-To-Speech Workflow: Use `parler-tts/parler-tts-mini-v1` and `bark/suno` to generate a conversational podcast Note 1: In Step 1, we prompt the 1B model to not modify the text or summarize it, strictly clean up extra characters or garbage characters that might get picked due to encoding from PDF. Please see the prompt in Notebook 1 for more details. -Note 2: For Step 2, you can also use `Llama-3.1-8B-Instruct` model, we recommend experimenting and trying if you see any differences. The 70B model was used here because it gave slightly more creative podcast transcripts for the tested examples. +Note 2: For Step 2, you can also use `Llama-3-8B-Instruct` model, we recommend experimenting and trying if you see any differences. The 70B model was used here because it gave slightly more creative podcast transcripts for the tested examples. Note 3: For Step 4, please try to extend the approach with other models. These models were chosen based on a sample prompt and worked best, newer models might sound better. Please see [Notes](./TTS_Notes.md) for some of the sample tests. diff --git a/end-to-end-use-cases/NotebookLlama/Step-1 PDF-Pre-Processing-Logic.ipynb b/end-to-end-use-cases/NotebookLlama/Step-1 PDF-Pre-Processing-Logic.ipynb index 2cf5d38d3..c3f32f1ec 100644 --- a/end-to-end-use-cases/NotebookLlama/Step-1 PDF-Pre-Processing-Logic.ipynb +++ b/end-to-end-use-cases/NotebookLlama/Step-1 PDF-Pre-Processing-Logic.ipynb @@ -17,7 +17,7 @@ "\n", "The first step in getting to the podcast is finding a script, right now our logic is:\n", "- Use any PDF on any topic\n", - "- Prompt `Llama-3.2-1B-Instruct` model to process it into a text file\n", + "- Prompt `Llama-3.2-1B-Instruct` (or Llama-8B model if you're using the API-just uncomment it) to process it into a text file\n", "- Re-write this into a podcast transcript in next notebook.\n", "\n", "In this notebook, we will upload a PDF and save it into a `.txt` file using the `PyPDF2` library, later we will process chunks from the text file using our featherlight model." @@ -33,13 +33,14 @@ }, { "cell_type": "code", - "execution_count": 41, + "execution_count": 1, "id": "f4fc7aef-3505-482e-a998-790b8b9d48e4", "metadata": {}, "outputs": [], "source": [ - "#!pip install PyPDF2\n", - "#!pip install rich ipywidgets" + "# !pip install PyPDF2\n", + "# !pip install rich ipywidgets\n", + "# !pip install llama-api-client" ] }, { @@ -54,18 +55,18 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 2, "id": "60d0061b-8b8c-4353-850f-f19466a0ae2d", "metadata": {}, "outputs": [], "source": [ - "pdf_path = './resources/2402.13116v4.pdf'\n", + "pdf_path = './resources/2407.21783v3.pdf'\n", "DEFAULT_MODEL = \"meta-llama/Llama-3.2-1B-Instruct\"" ] }, { "cell_type": "code", - "execution_count": 49, + "execution_count": 15, "id": "21029232-ac5f-42ca-b26b-baad5b2f49b7", "metadata": {}, "outputs": [], @@ -73,13 +74,15 @@ "import PyPDF2\n", "from typing import Optional\n", "import os\n", - "import torch\n", - "from accelerate import Accelerator\n", - "from transformers import AutoModelForCausalLM, AutoTokenizer\n", - "\n", + "from llama_api_client import LlamaAPIClient\n", "from tqdm.notebook import tqdm\n", "import warnings\n", "\n", + "# Initialize the Llama API client\n", + "import os\n", + "os.environ[\"LLAMA_API_KEY\"] = \"api-key\"\n", + "client = LlamaAPIClient()\n", + "\n", "warnings.filterwarnings('ignore')" ] }, @@ -93,7 +96,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 4, "id": "153d9ece-37a4-4fff-a8e8-53f923a2b0a0", "metadata": {}, "outputs": [], @@ -120,12 +123,12 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 5, "id": "b57c2d64-3d75-4aeb-b4ee-bd1661286b66", "metadata": {}, "outputs": [], "source": [ - "def extract_text_from_pdf(file_path: str, max_chars: int = 100000) -> Optional[str]:\n", + "def extract_text_from_pdf(file_path: str, max_chars: int = 100000):\n", " if not validate_pdf(file_path):\n", " return None\n", " \n", @@ -181,13 +184,13 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 6, "id": "0984bb1e-d52c-4cec-a131-67a48061fabc", "metadata": {}, "outputs": [], "source": [ "# Get PDF metadata\n", - "def get_pdf_metadata(file_path: str) -> Optional[dict]:\n", + "def get_pdf_metadata(file_path: str):\n", " if not validate_pdf(file_path):\n", " return None\n", " \n", @@ -214,7 +217,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 7, "id": "63848943-79cc-4e21-8396-6eab5df493e0", "metadata": {}, "outputs": [ @@ -225,13 +228,13 @@ "Extracting metadata...\n", "\n", "PDF Metadata:\n", - "Number of pages: 44\n", + "Number of pages: 92\n", "Document info:\n", "/Author: \n", - "/CreationDate: D:20240311015030Z\n", + "/CreationDate: D:20241126014049Z\n", "/Creator: LaTeX with hyperref\n", "/Keywords: \n", - "/ModDate: D:20240311015030Z\n", + "/ModDate: D:20241126014049Z\n", "/PTEX.Fullbanner: This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5\n", "/Producer: pdfTeX-1.40.25\n", "/Subject: \n", @@ -239,44 +242,51 @@ "/Trapped: /False\n", "\n", "Extracting text...\n", - "Processing PDF with 44 pages...\n", - "Processed page 1/44\n", - "Processed page 2/44\n", - "Processed page 3/44\n", - "Processed page 4/44\n", - "Processed page 5/44\n", - "Processed page 6/44\n", - "Processed page 7/44\n", - "Processed page 8/44\n", - "Processed page 9/44\n", - "Processed page 10/44\n", - "Processed page 11/44\n", - "Processed page 12/44\n", - "Processed page 13/44\n", - "Processed page 14/44\n", - "Processed page 15/44\n", - "Processed page 16/44\n", - "Reached 100000 character limit at page 17\n", - "\n", - "Extraction complete! Total characters: 100016\n", + "Processing PDF with 92 pages...\n", + "Processed page 1/92\n", + "Processed page 2/92\n", + "Processed page 3/92\n", + "Processed page 4/92\n", + "Processed page 5/92\n", + "Processed page 6/92\n", + "Processed page 7/92\n", + "Processed page 8/92\n", + "Processed page 9/92\n", + "Processed page 10/92\n", + "Processed page 11/92\n", + "Processed page 12/92\n", + "Processed page 13/92\n", + "Processed page 14/92\n", + "Processed page 15/92\n", + "Processed page 16/92\n", + "Processed page 17/92\n", + "Processed page 18/92\n", + "Processed page 19/92\n", + "Processed page 20/92\n", + "Processed page 21/92\n", + "Processed page 22/92\n", + "Processed page 23/92\n", + "Processed page 24/92\n", + "Processed page 25/92\n", + "Processed page 26/92\n", + "Reached 100000 character limit at page 27\n", + "\n", + "Extraction complete! Total characters: 100026\n", "\n", "Preview of extracted text (first 500 characters):\n", "--------------------------------------------------\n", - "1\n", - "A Survey on Knowledge Distillation of Large\n", - "Language Models\n", - "Xiaohan Xu1, Ming Li2, Chongyang Tao3, Tao Shen4, Reynold Cheng1, Jinyang Li1,\n", - "Can Xu5, Dacheng Tao6, Tianyi Zhou2\n", - "1The University of Hong Kong2University of Maryland3Microsoft\n", - "4University of Technology Sydney5Peking University6The University of Sydney\n", - "{shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu\n", - "ckcheng@cs.hku.hk jl0725@connect.hku.hk\n", - "Abstract —In the era of Large Language Models (LLMs), Knowledge Distillati\n", + "The Llama 3 Herd of Models\n", + "Llama Team, AI @ Meta1\n", + "1A detailed contributor list can be found in the appendix of this paper.\n", + "Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a\n", + "new set of foundation models, called Llama 3. It is a herd of language models that natively support\n", + "multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with\n", + "405B parameters and a context window of up to 128K tokens. This paper presents \n", "--------------------------------------------------\n", "\n", - "Total characters extracted: 100016\n", + "Total characters extracted: 100026\n", "\n", - "Extracted text has been saved to extracted_text.txt\n" + "Extracted text has been saved to ./resources/extracted_text.txt\n" ] } ], @@ -305,7 +315,7 @@ "\n", "# Optional: Save the extracted text to a file\n", "if extracted_text:\n", - " output_file = 'extracted_text.txt'\n", + " output_file = './resources/extracted_text.txt'\n", " with open(output_file, 'w', encoding='utf-8') as f:\n", " f.write(extracted_text)\n", " print(f\"\\nExtracted text has been saved to {output_file}\")" @@ -329,12 +339,12 @@ }, { "cell_type": "code", - "execution_count": 60, + "execution_count": 8, "id": "7c0828a5-964d-475e-b5f5-40a04e287725", "metadata": {}, "outputs": [], "source": [ - "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", + "#device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", "\n", "SYS_PROMPT = \"\"\"\n", "You are a world class text pre-processor, here is the raw data from a PDF, please parse and return it in a way that is crispy and usable to send to a podcast writer.\n", @@ -368,7 +378,7 @@ }, { "cell_type": "code", - "execution_count": 61, + "execution_count": 9, "id": "24e8a547-9d7c-4e2f-be9e-a3aea09cce76", "metadata": {}, "outputs": [], @@ -410,51 +420,51 @@ }, { "cell_type": "code", - "execution_count": 62, + "execution_count": 44, "id": "d04a4f07-b0b3-45ca-8f41-a433e1abe050", "metadata": {}, "outputs": [], "source": [ - "accelerator = Accelerator()\n", - "model = AutoModelForCausalLM.from_pretrained(\n", - " DEFAULT_MODEL,\n", - " torch_dtype=torch.bfloat16,\n", - " use_safetensors=True,\n", - " device_map=device,\n", - ")\n", - "tokenizer = AutoTokenizer.from_pretrained(DEFAULT_MODEL, use_safetensors=True)\n", - "model, tokenizer = accelerator.prepare(model, tokenizer)" + "# accelerator = Accelerator()\n", + "# model = AutoModelForCausalLM.from_pretrained(\n", + "# DEFAULT_MODEL,\n", + "# torch_dtype=torch.bfloat16,\n", + "# use_safetensors=True,\n", + "# device_map=device,\n", + "# )\n", + "# tokenizer = AutoTokenizer.from_pretrained(DEFAULT_MODEL, use_safetensors=True)\n", + "# model, tokenizer = accelerator.prepare(model, tokenizer)\n", + "\n" ] }, { "cell_type": "code", - "execution_count": 63, + "execution_count": 10, "id": "bbda5241-e890-4402-87dd-514d6761bb9c", "metadata": {}, "outputs": [], "source": [ "def process_chunk(text_chunk, chunk_num):\n", - " \"\"\"Process a chunk of text and return both input and output for verification\"\"\"\n", - " conversation = [\n", - " {\"role\": \"system\", \"content\": SYS_PROMPT},\n", - " {\"role\": \"user\", \"content\": text_chunk},\n", - " ]\n", - " \n", - " prompt = tokenizer.apply_chat_template(conversation, tokenize=False)\n", - " inputs = tokenizer(prompt, return_tensors=\"pt\").to(device)\n", - " \n", - " with torch.no_grad():\n", - " output = model.generate(\n", - " **inputs,\n", - " temperature=0.7,\n", - " top_p=0.9,\n", - " max_new_tokens=512\n", - " )\n", + " \"\"\"Process a chunk of text using Llama API and return output\"\"\"\n", + " response = client.chat.completions.create(\n", + " model=\"Llama-3.3-8B-Instruct\", # Use the appropriate model ID\n", + " messages=[\n", + " {\n", + " \"role\": \"system\", \n", + " \"content\": SYS_PROMPT\n", + " },\n", + " {\n", + " \"role\": \"user\", \n", + " \"content\": text_chunk\n", + " }\n", + " ],\n", + " max_completion_tokens=512,\n", + " temperature=0.7,\n", + " )\n", " \n", - " processed_text = tokenizer.decode(output[0], skip_special_tokens=True)[len(prompt):].strip()\n", + " processed_text = response.completion_message.content.text\n", " \n", " # Print chunk information for monitoring\n", - " #print(f\"\\n{'='*40} Chunk {chunk_num} {'='*40}\")\n", " print(f\"INPUT TEXT:\\n{text_chunk[:500]}...\") # Show first 500 chars of input\n", " print(f\"\\nPROCESSED TEXT:\\n{processed_text[:500]}...\") # Show first 500 chars of output\n", " print(f\"{'='*90}\\n\")\n", @@ -464,21 +474,26 @@ }, { "cell_type": "code", - "execution_count": 64, - "id": "a0183c47-339d-4041-ae83-77fc34931075", + "execution_count": 11, + "id": "5aadf30f", "metadata": {}, "outputs": [], "source": [ + "# Add this code block before creating chunks:\n", + "# Load the extracted text from file\n", + "with open('./resources/extracted_text.txt', 'r', encoding='utf-8') as f:\n", + " text = f.read()\n", + "\n", "INPUT_FILE = \"./resources/extracted_text.txt\" # Replace with your file path\n", "CHUNK_SIZE = 1000 # Adjust chunk size if needed\n", "\n", "chunks = create_word_bounded_chunks(text, CHUNK_SIZE)\n", - "num_chunks = len(chunks)\n" + "num_chunks = len(chunks)" ] }, { "cell_type": "code", - "execution_count": 65, + "execution_count": 12, "id": "bb36814f-9310-4734-bf54-e16a5032339e", "metadata": {}, "outputs": [ @@ -488,7 +503,7 @@ "101" ] }, - "execution_count": 65, + "execution_count": 12, "metadata": {}, "output_type": "execute_result" } @@ -499,7 +514,7 @@ }, { "cell_type": "code", - "execution_count": 66, + "execution_count": 13, "id": "447188d3-ebf0-42d5-940e-4d7e0d9dbf32", "metadata": {}, "outputs": [], @@ -513,19 +528,20 @@ "\n", "# Cell 6: Process the file with ordered output\n", "# Create output file name\n", - "output_file = f\"clean_{os.path.basename(INPUT_FILE)}\"" + "input_dir = os.path.dirname(INPUT_FILE)\n", + "output_file = os.path.join(input_dir, f\"clean_{os.path.basename(INPUT_FILE)}\")\n" ] }, { "cell_type": "code", - "execution_count": 67, + "execution_count": 14, "id": "7917dfdd-b3af-44fc-a8c0-2760ace9363e", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "b767f45b5e514e7db936cef825af6fce", + "model_id": "67a9ffc357cf434f9583be2151572655", "version_major": 2, "version_minor": 0 }, @@ -536,2092 +552,166 @@ "metadata": {}, "output_type": "display_data" }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n", - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, { "name": "stdout", "output_type": "stream", "text": [ "INPUT TEXT:\n", - "1 A Survey on Knowledge Distillation of Large Language Models Xiaohan Xu1, Ming Li2, Chongyang Tao3, Tao Shen4, Reynold Cheng1, Jinyang Li1, Can Xu5, Dacheng Tao6, Tianyi Zhou2 1The University of Hong Kong2University of Maryland3Microsoft 4University of Technology Sydney5Peking University6The University of Sydney {shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu ckcheng@cs.hku.hk jl0725@connect.hku.hk Abstract —In the era of Large Language Models (LLMs), Knowledge Distillati...\n", + "The Llama 3 Herd of Models Llama Team, AI @ Meta1 1A detailed contributor list can be found in the appendix of this paper. Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents ...\n", "\n", "PROCESSED TEXT:\n", - "===============\n", - "\n", - "Knowledge Distillation is a methodology that transfers advanced capabilities from leading proprietary Large Language Models (LLMs) to their open-source counterparts, such as LLaMA and Mistral. This paper presents a comprehensive survey of KD's role in imparting advanced knowledge.\n", - "\n", - "Abstract —In the era of Large Language Models, Knowledge Distillation emerges as a pivotal methodology for transferring advanced capabilities from proprietary LLMs to open-source counterparts, facilit...\n", + "Modern artificial intelligence systems are powered by foundation models. This paper presents a new set of foundation models called Llama 3. It is a herd of language models that support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. Llama 3 delivers comparable quality to leading language models on a plethora of tasks. We publicly release Llama 3 including pre-trained and post-trained vers...\n", "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "\n", "INPUT TEXT:\n", - "advanced knowledge to smaller models and its utility in model compression and self- improvement. Our survey is meticulously structured around three foundational pillars: algorithm ,skill, and verticalization – providing a comprehensive examination of KD mechanisms, the enhancement of specific cognitive abilities, and their practical implications across diverse fields. Crucially, the survey navigates the intricate interplay between data augmentation (DA) and KD, illustrating how DA emerges as a p...\n", + "this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development. Date:July 23, 2024 Website: https://llama.meta.com/ 1 Introduction Foundation models are general models of language, vision, speech, and/or other modalities that are designed to support a large variety of AI tasks. They form the basis of many modern AI systems. The development of modern foundatio...\n", "\n", "PROCESSED TEXT:\n", - "xamined through a meticulous survey that delves into the foundational pillars of algorithm, skill, and verticalization, which form the backbone of knowledge distillation and deep learning models. The survey provides a comprehensive examination of key mechanisms within the knowledge distillation framework, specifically focusing on the enhancement of cognitive abilities and their practical implications across various fields, with a particular emphasis on the interplay between data augmentation (DA...\n", + "this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development. \n", + "Foundation models are general models of language, vision, speech, and/or other modalities that are designed to support a large variety of AI tasks. They form the basis of many modern AI systems. \n", + "The development of modern foundation models consists of two main stages: a pre-training stage and a ...\n", "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "\n", "INPUT TEXT:\n", - "distillation and proposing future research directions. By bridging the gap between proprietary and open-source LLMs, this survey underscores the potential for more accessible, efficient, and powerful AI solutions. Most importantly, we firmly advocate for compliance with the legal terms that regulate the use of LLMs, ensuring ethical and lawful application of KD of LLMs. An associated Github repository is available at https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs. Index Terms —...\n", + "multilinguality, coding, reasoning, and tool usage. Our largest model is dense Transformer with 405B parameters, processing information in a context window of up to 128K tokens. Each member of the herd is listed in Table 1. All the results presented in this paper are for the Llama 3.1 models, which we will refer to as Llama 3 throughout for brevity. We believe there are three key levers in the development of high-quality foundation models: data, scale, and managing complexity. We seek to optimiz...\n", "\n", "PROCESSED TEXT:\n", - "en-source LLMs, this survey highlights the potential for more accessible, efficient, and powerful AI solutions.\n", - "\n", - "Most importantly, we advocate for compliance with legal terms that regulate the use of LLMs, ensuring ethical and lawful application of knowledge distillation.\n", - "\n", - "An associated Github repository is available at https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs. Index Terms - Large language models, knowledge distillation, data augmentation, skill distillation, supervised f...\n", + "Our largest model is a dense Transformer with 405B parameters, processing information in a context window of up to 128K tokens. We believe there are three key levers in the development of high-quality foundation models: data, scale, and managing complexity. We seek to optimize for these three levers in our development process. \n", + "We improved the quantity and quality of the data we use for pre-training and post-training compared to prior versions of Llama. This includes more careful pre-processing ...\n", "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "\n", "INPUT TEXT:\n", - "complexity, have un- locked new realms of possibility, from generating human- like text to offering sophisticated problem-solving capa- bilities. The core significance of these LLMs lies in their emergent abilities (Wei et al., 2022a,b; Xu et al., 2024a), a phenomenon where the models display capabilities beyond their explicit training objectives, enabling them to tackle a diverse array of tasks with remarkable proficiency. Their deep understanding of context, nuance, and the intrica- cies of hu...\n", + "compared to 1.8T tokens for Llama 2. •Scale.We train a model at far larger scale than previous Llama models: our flagship language model was pre-trained using 3.8×1025FLOPs, almost 50×more than the largest version of Llama 2. Specifically, we pre-trained a flagship model with 405B trainable parameters on 15.6T text tokens. As expected per 1arXiv:2407.21783v3 [cs.AI] 23 Nov 2024 Finetuned Multilingual Long context Tool use Release Llama 3 8B ✗ ✗1✗ ✗ April 2024 Llama 3 8B Instruct ✓ ✗ ✗ ✗ April 20...\n", "\n", "PROCESSED TEXT:\n", - "sophisticated problem-solving capabilities, the core significance of these large language models (LLMs) lies in their emergent abilities, enabling them to tackle a diverse array of tasks with remarkable proficiency....\n", + "We train a model at a larger scale than previous Llama models. Our flagship language model was pre-trained using a large amount of computational power and 405B trainable parameters on 15.6T text tokens. This is significantly larger than the largest version of Llama 2. All results in this paper are for the Llama 3.1 models....\n", "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "\n", "INPUT TEXT:\n", - "applications, promising to revolutionize industries, augment human creativity, and redefine our interaction with technology. Despite the remarkable capabilities of proprietary LLMs like GPT-4 and Gemini, they are not without their shortcom- ings, particularly when viewed in light of the advantages offered by open-source models. A significant drawback is their limited accessibility and higher cost (OpenAI et al., 2023). These proprietary models often come with substantial usage fees and restricte...\n", + "same procedure. While our scaling laws suggest our flagship model is an approximately compute-optimal size for our training budget, we also train our smaller models for much longer than is compute-optimal. The resulting models perform better than compute-optimal models at the same inference budget. We use the flagship model to further improve the quality of those smaller models during post-training. •Managing complexity. We make design choices that seek to maximize our ability to scale the model...\n", "\n", "PROCESSED TEXT:\n", - "their remarkable capabilities, have some notable limitations, particularly when considering the advantages offered by open-source models, such as GPT-4 and Gemini. These models are often expensive, with substantial usage fees and restricted access, making them inaccessible to individuals and smaller organizations....\n", + "We scale our model to an approximately compute-optimal size for our training budget. We also train smaller models for longer than is compute-optimal. These models perform better than compute-optimal models at the same inference budget. We use the flagship model to improve the quality of smaller models after training. \n", + "\n", + "We make design choices to maximize our ability to scale the model development process. We use a standard dense Transformer model architecture with minor adaptations. We adopt a si...\n", "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "\n", "INPUT TEXT:\n", - "applica- tions. The constraints of accessibility, cost, and adaptability thus present significant challenges in leveraging the full potential of proprietary LLMs. In contrast to proprietary LLMs, open-source modelsarXiv:2402.13116v3 [cs.CL] 8 Mar 2024 2 like LLaMA (Touvron et al., 2023) and Mistral (Jiang et al., 2023a) bring several notable advantages. One of the primary benefits of open-source models is their accessibility and adaptability. Without the constraints of licensing fees or restrict...\n", + "et al., 2022; Schulman et al., 2017) that tend to be less stable and harder to scale. The result of our work is Llama 3: a herd of three multilingual1language models with 8B, 70B, and 405B parameters. We evaluate the performance of Llama 3 on a plethora of benchmark datasets that span a wide range of language understanding tasks. In addition, we perform extensive human evaluations that compare Llama 3 with competing models. An overview of the performance of the flagship Llama 3 model on key benc...\n", "\n", "PROCESSED TEXT:\n", - "ng restrictions and costs. In contrast, open-source LLMs like LLaMA and Mistral bring several advantages. Accessibility and adaptability are key benefits, as they are more readily available to a broader range of users, including researchers and organizations....\n", + "Our work results in Llama 3, a herd of three multilingual language models with 8B, 70B, and 405B parameters. We evaluate Llama 3 on various benchmark datasets and perform human evaluations comparing it to competing models. The flagship model performs on par with leading language models across different tasks and matches the state-of-the-art. Smaller models outperform alternatives with similar parameters. Llama 3 also offers a better balance between being helpful and harmless than its predecessor...\n", "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "\n", "INPUT TEXT:\n", - "of drawbacks, primarily stemming from their relatively limited scale and resources compared to their proprietary counterparts. One of the most significant limitations is the smaller model scale, which often results in lower per- formance on real-world tasks with a bunch of instruc- tions (Zheng et al., 2023a). These models, with fewer pa- rameters, may struggle to capture the depth and breadth of knowledge embodied in larger models like GPT-4. Ad- ditionally, the pre-training investment in these...\n", + "et al., 2023b). We present a detailed analysis of the safety of Llama 3 in Section 5.4. We are publicly releasing all three Llama 3 models under an updated version of the Llama 3 Community License; seehttps://llama.meta.com . This includes pre-trained and post-trained versions of our 405B parameter language model and a new version of our Llama Guard model (Inan et al., 2023) for input and output safety. We hope that the open release of a flagship model will spur a wave of innovation in the resea...\n", "\n", "PROCESSED TEXT:\n", - "ts. One of the most significant limitations is the smaller model scale, resulting in lower performance on real-world tasks with multiple instructions (Zheng et al., 2023a). Models with fewer parameters struggle to capture the depth and breadth of knowledge embodied in larger models like GPT-4. Additionally, the pre-training investment in these open-source models is typically less substantial. This reduced investment can lead to a narrower range of pre-training data, potentially limiting their un...\n", + "We present a detailed analysis of the safety of Llama 3 in Section 5.4. We are releasing all three Llama 3 models under an updated version of the Llama 3 Community License. This includes pre-trained and post-trained versions of our 405B parameter language model and a new version of our Llama Guard model for input and output safety. The open release of the model is hoped to spur innovation in the research community and accelerate a responsible path towards artificial general intelligence. As part...\n", "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "\n", "INPUT TEXT:\n", - "effectiveness in specialized applications. This limitation becomes particularly evident when these models are compared to the highly fine-tuned proprietary LLMs, which are often tailored to excel in a wide array of complex scenarios (OpenAI et al., 2023). Primarily, recognizing the disparities between propri- etary and open-source LLMs, KD techniques have surged as a means to bridge the performance gap between these models (Gou et al., 2021; Gupta and Agrawal, 2022). Knowl- edge distillation, in...\n", + "models. 1The Llama 3 8B and 70B were pre-trained on multilingual data but were intended for use in English at the time. 2 Category Benchmark Llama 3 8B Gemma 2 9B Mistral 7B Llama 3 70B Mixtral 8x22B GPT 3.5 Turbo Llama 3 405B Nemotron 4 340B GPT-4 (0125) GPT-4o Claude 3.5 Sonnet GeneralMMLU (5-shot) 69.4 72.361.1 83.676.970.787.382.6 85.189.1 89.9 MMLU (0-shot, CoT) 73.0 72.3△60.5 86.079.969.888.6 78.7◁85.4 88.7 88.3 MMLU-Pro (5-shot, CoT) 48.3 –36.9 66.456.349.273.362.7 64.874.0 77.0 IFEval 80...\n", "\n", "PROCESSED TEXT:\n", - "ary models becomes apparent when compared to highly fine-tuned proprietary LLMs. Primarily, the disparity between proprietary and open-source LLMs becomes evident, with proprietary models excelling in complex scenarios, while open-source models excel in a wide range of scenarios. Knowledge distillation, a technique that leverages the advanced capabilities of proprietary models, is used to enhance the competencies of open-source models. This process is similar to transferring the performance of a...\n", + "Llama 3 8B and Llama 3 70B were pre-trained on multilingual data for use in English. \n", + "Category Benchmark results: \n", + "Llama 3 8B, Gemma 2 9B, Mistral 7B, Llama 3 70B, Mixtral 8x22B, GPT 3.5 Turbo, Llama 3 405B, Nemotron 4 340B, GPT-4, Claude 3.5, Sonnet, GeneralMMLU, MMLU (0-shot, CoT), MMLU-Pro (5-shot, CoT), IFEval, CodeHumanEval (0-shot), MBPP EvalPlus (0-shot), MathGSM8K, MATH (0-shot, CoT), ReasoningARC Challenge, GPQA (0-shot, CoT), Tool useBFCL. \n", + "Results: \n", + "69.4, 72.36, 11.1, 83.67, 6.97, 70....\n", "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "\n", "INPUT TEXT:\n", - "augmentation (DA) (Feng et al., 2021) has emerged as a prevalent paradigm to achieve knowledge distillation of LLMs, where a small seed of knowledge is used to prompt the LLM to generate more data with respect to a specific skill or domain (Taori et al., 2023). Secondly, KD still retains its fundamental role in compressing LLMs, making them more efficient without significant loss in performance. (Gu et al., 2024; Agarwal et al., 2024). More recently, the strategy of employing open-source LLMs as...\n", + "76.1 –60.484.8– 85.988.586.5 88.380.5 90.2 Nexus 38.5 30.024.7 56.748.537.2 58.7 –50.356.1 45.7 Long contextZeroSCROLLS/QuALITY 81.0 ––90.5–– 95.2 – 95.2 90.5 90.5 InfiniteBench/En.MC 65.1 ––78.2–– 83.4 –72.182.5 – NIH/Multi-needle 98.8 ––97.5––98.1 – 100.0 100.0 90.8 Multilingual MGSM (0-shot, CoT) 68.9 53.229.9 86.971.151.4 91.6 –85.990.5 91.6 Table 2 Performance of finetuned Llama 3 models on key benchmark evaluations. The table compares the performance of the 8B, 70B, and 405B versions of Ll...\n", "\n", "PROCESSED TEXT:\n", - "tillation of LLMs, where a small seed of knowledge is used to prompt the LLM to generate more data with respect to a specific skill or domain (Taori et al., 2023). Furthermore, KD retains its fundamental role in compressing LLMs, making them more efficient without significant loss in performance....\n", + "The development of our Llama 3 language models comprises two main stages: \n", + "Language model pre-training. We start by converting a large, multilingual text corpus to \n", + "fine-tune the models on key benchmark evaluations. The table compares the performance of the 8B, 70B, and 405B versions of Llama 3 with that of competing models....\n", "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "\n", "INPUT TEXT:\n", - "trend of self-improvement via self-generated knowledge. A key aspect of the knowledge distillation is the en- hancement of skills such as advanced context following (e.g., in-context learning (Huang et al., 2022a) and in- struction following (Taori et al., 2023)), improved align- ment with user intents (e.g., human values/principles (Cui et al., 2023a), and thinking patterns like chain-of-thought (CoT) (Mukherjee et al., 2023)), and NLP task specialization (e.g., semantic understanding (Ding et ...\n", + "discrete tokens and pre-training a large language model (LLM) on the resulting data to perform next-token prediction. In the language model pre-training stage, the model learns the structure of language and obtains large amounts of knowledge about the world from the text it is “reading”. To do this effectively, pre-training is performed at massive scale: we pre-train a model with 405B parameters on 15.6T tokens using a context window of 8K tokens. This standard pre-training stage is followed by ...\n", "\n", "PROCESSED TEXT:\n", - "advanced context following and instruction following**\n", - "\n", - "**key aspects of knowledge distillation**\n", - "\n", - "* **contextual understanding**: in-context learning and instruction following\n", - "* **alignment with user intents**: human values/principles and thinking patterns like chain-of-thought\n", - "* **NLP task specialization**: semantic understanding and code generation\n", - "\n", - "**critical skills for various applications**\n", - "\n", - "* **healthcare**: accuracy and contextual knowledge\n", - "* **law**: contextual knowledge and precision\n", - "*...\n", + "pre-training a large language model on text data enables it to learn language structure and gain knowledge about the world. This pre-training is done on a massive scale with a model having 405B parameters and 15.6T tokens. After the initial pre-training, the model undergoes continued pre-training to increase its context window. \n", + "The pre-trained model then requires post-training to align with human expectations and follow instructions. This involves supervised finetuning and Direct Preference Opt...\n", "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "\n", "INPUT TEXT:\n", - "performance by learning from the proprietary models that have been extensively trained and fine-tuned in these areas. The benefits of knowledge distillation in the era of LLMs are multifaceted and transformative (Gu et al., 2024). Through a suite of distillation techniques, the gap between proprietary and open-source models is significantly nar- rowed (Chiang et al., 2023; Xu et al., 2023a) and even filled (Zhao et al., 2023a). This process not only streamlines computational requirements but als...\n", + "al., 2024). At this post-training2stage, we also integrate new capabilities, such as tool-use, and observe strong improvements in other areas, such as coding and reasoning. See Section 4 for details. Finally, safety mitigations are also incorporated into the model at the post-training stage, the details of which are described in Section 5.4. The resulting models have a rich set of capabilities. They can answer questions in at least eight languages, write high-quality code, solve complex reasonin...\n", "\n", "PROCESSED TEXT:\n", - "ned in the era of LLMs, the benefits of knowledge distillation in the era of LLMs are multifaceted and transformative. Through a suite of distillation techniques, the gap between proprietary and open-source models narrows and is filled. This process streamlines computational requirements and enhances environmental sustainability of AI operations, as open-source models become more proficient with lower overhead....\n", + "We integrate new capabilities such as tool-use at the post-training stage, resulting in strong improvements in areas like coding and reasoning. The models can answer questions in multiple languages, write high-quality code, solve complex problems, and use tools. We also add image, video, and speech capabilities to Llama 3 using a compositional approach. This approach includes training separate encoders for images and speech, teaching the model the relationship between visual content and text....\n", "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "\n", "INPUT TEXT:\n", - "catalyzing innovation and growth across various industries and research domains. The escalating need for a comprehensive survey on the knowledge distillation of LLMs stems from the rapidly evolving landscape of AI (OpenAI et al., 2023; Team et al., 2023) and the increasing complexity of these models. As AI continues to penetrate various sectors, the ability to effi- ciently and effectively distill knowledge from proprietary LLMs to open-source ones becomes not just a technical aspiration but a p...\n", + "description of that content in natural language. Our speech encoder is trained using a 2In this paper, we use the term “post-training” to refer to any model training that happens outside of pre-training. 3 Figure 1 Illustration of the overall architecture and training of Llama 3. Llama 3 is a Transformer language model trained to predict the next token of a textual sequence. See text for details. self-supervised approach that masks out parts of the speech inputs and tries to reconstruct the mask...\n", "\n", "PROCESSED TEXT:\n", - "ch domains. The escalating need for a comprehensive survey on the knowledge distillation of LLMs stems from the rapidly evolving landscape of AI and the increasing complexity of these models. The ability to efficiently and effectively distill knowledge from proprietary LLMs to open-source ones becomes a practical necessity. This is driven by the need to bridge the knowledge gap between the proprietary and open-source LLMs.\n", - "\n", - "This need is driven by the 3 models mentioned, including Student, Vicuna...\n", + "We use the term post-training to refer to any model training that happens outside of pre-training. Llama 3 is a Transformer language model trained to predict the next token of a textual sequence. The model learns the structure of speech signals using a self-supervised approach that masks out parts of the speech inputs and tries to reconstruct the masked out parts. We also train an adapter that integrates a pre-trained image encoder into the pre-trained language model....\n", "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "\n", "INPUT TEXT:\n", - "SupervisedFine-tuningX,Y preferenceRankOptimizationy,1y,2y3y1y2y3≻≻rank…… DataCuration X,YrawdatasynthesizefeedbackFeedback input outputSelf-Knowledge outputinputinput YlabelLabelingExpansion X,YdemonstrationsexpandFeature featureinput,outputextractSec.4Sec.5 Sec.3.1Sec.3.2 Fig. 2: An overview of this survey on knowledge distillation of large language models. Note that ‘Section’ is abbreviated as ‘Sec.’ in this figure. RM S(·)denotes the student reward model. the growing demand for more accessib...\n", + "pairs. This aligns the image representations with the language representations. During adapter training, we also update the parameters of the image encoder but we intentionally do not update the language-model parameters. We also train a video adapter on top of the image adapter on paired video-text data. This enables the model to aggregate information across frames. See Section 7 for details. •Speech adapter training. Finally, we integrate the speech encoder into the model via an adapter that c...\n", "\n", "PROCESSED TEXT:\n", - "synthesizefeedbackFeedback input outputSelf-Knowledge outputinputinput YlabelLabelingExpansion X,Y demonstrationsexpandFeature featureinput,outputextractSec.4Sec.5 Sec.3.1Sec.3.2 Fig. 2: An overview of this survey on knowledge distillation of large language models...\n", + "This aligns the image representations with the language representations. During adapter training, we update the parameters of the image encoder and train a video adapter on paired video-text data to aggregate information across frames. We also integrate the speech encoder into the model via an adapter that converts speech encodings into token representations for the finetuned language model. The parameters of the adapter and encoder are jointly updated for high-quality speech understanding. Our ...\n", "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "\n", "INPUT TEXT:\n", - "gaps in current techniques and proposing direc- tions for future research. Survey Organization. The remainder of this survey is orga- nized into several comprehensive sections, each designed to offer a deep dive into the multifaceted aspects of knowledge distillation within the realm ofLLMs. Following this intro- duction, §2 provides a foundational overview of knowledge distillation, comparing traditional techniques with those emerging in the era of LLMs and highlighting the role of data augment...\n", + "interaction via a speech interface. These models are still under development and not yet ready for release. 3 Pre-Training Language model pre-training involves: (1)the curation and filtering of a large-scale training corpus, (2)the development of a model architecture and corresponding scaling laws for determining model size, (3)the development of techniques for efficient pre-training at large scale, and (4)the development of a pre-training recipe. We present each of these components separately b...\n", "\n", "PROCESSED TEXT:\n", - "es emerging, but there is still much to be learned from the era of Large Language Models (LLMs). In this section, we provide a foundational overview of knowledge distillation, highlighting the role of data augmentation (DA) in this context.\n", + "We create a dataset for language model pre-training from various data sources up to 2023. We apply de-duplication methods and data cleaning to obtain high-quality tokens. We remove data containing personally identifiable information and adult content. \n", "\n", - "Traditional techniques, such as supervised fine-tuning, have shown promise in distilling knowledge from LLMs. However, the increasing complexity of these models requires careful consideration of the trade-offs between accuracy and computational resources. To...\n", + "We obtain much of our data from the web and describe...\n", "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "\n", "INPUT TEXT:\n", - "includes discus- sions on natural language understanding (NLU), genera- tion (NLG), information retrieval, recommendation systems, and the evaluation of text generation. In §5, we ventureinto domain-specific vertical distillation, showcasing how knowledge distillation techniques are applied within spe- cialized fields such as law, healthcare, finance, and science, illustrating the practical implications and transformative impact of these approaches. The survey suggests open problems in §6, ident...\n", + "our cleaning process below. PII and safety filtering. Among other mitigations, we implement filters designed to remove data from websites are likely to contain unsafe content or high volumes of PII, domains that have been ranked as harmful according to a variety of Meta safety standards, and domains that are known to contain adult content. 4 Text extraction and cleaning. We process the raw HTML content for non-truncated web documents to extract high-quality diverse text. To do so, we build a cus...\n", "\n", "PROCESSED TEXT:\n", - "mmendation systems, and the evaluation of text generation. In §5, we delve into domain-specific vertical distillation, demonstrating how knowledge distillation techniques are applied in specialized fields such as law, healthcare, finance, and science, highlighting their practical implications and transformative impact. The survey reveals open problems in §6, highlighting current challenges and gaps in knowledge distillation research that present opportunities for future work....\n", + "Our cleaning process involves PII and safety filtering. We remove data from websites that may contain unsafe content or high volumes of personally identifiable information. We also filter out domains ranked as harmful by Meta's safety standards and those known to contain adult content. \n", + "\n", + "Text extraction and cleaning are key steps. We process raw HTML content to extract high-quality, diverse text. Our custom parser is designed to optimize for precision in removing boilerplate and ensuring content...\n", "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "\n", "INPUT TEXT:\n", - "process of transferring knowledge from a large, complex model (teacher) to a smaller, more efficient model (student) (Gou et al., 2021). This technique is pivotal in mitigating the challenges posed by the computational demands and resource constraints of deploying large-scale models in practical applications. Historically, knowledge distillation techniques, prior to the era of LLMs, primarily concentrated on transferring knowledge from complex, often cumbersome neural net- works to more compact ...\n", + "pre-rendered images where the math is also provided in the altattribute. We experimentally evaluate different cleaning configurations. We find markdown is harmful to the performance of a model that is primarily trained on web data compared to plain text, so we remove all markdown markers. De-duplication. We apply several rounds of de-duplication at the URL, document, and line level: •URL-level de-duplication. We perform URL-level de-duplication across the entire dataset. We keep the most recent ...\n", "\n", "PROCESSED TEXT:\n", - "large, complex model to a smaller, more efficient model, mitigating the challenges of computational demands and resource constraints in deploying large-scale models in practical applications. This process, prior to the era of Large Language Models (LLMs), focused on compacting complex neural networks for deployment in resource-constrained environments, such as mobile devices or edge computing platforms, where computational efficiency was paramount....\n", + "We experimentally evaluate different cleaning configurations. We find that markdown is harmful to the performance of a model trained on web data compared to plain text, so we remove all markdown markers. \n", + "We apply several rounds of de-duplication at the URL, document, and line level. \n", + "We keep the most recent version for pages corresponding to each URL. \n", + "We perform global de-duplication across the entire dataset to remove near duplicate documents. \n", + "We remove lines that appeared more than 6 times ...\n", "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "\n", "INPUT TEXT:\n", - "Mammoth (Yue et al., 2023a), Mixed Distill (Chenglin et al., 2023) ExpansionSelf-Instruct (Wang et al., 2022a), Alpaca (Taori et al., 2023), Code Alpaca (Chaudhary, 2023) Self-Align (Sun et al., 2024b), WizardLM (Xu et al., 2023a), WizardCoder (Luo et al., 2023a), WizardMath (Luo et al., 2023b), AugGPT (Dai et al., 2023a), TDG (He et al., 2023b) CurationUltraChat (Ding et al., 2023b), Phi-1 (Gunasekar et al., 2023), Phi-1.5 (Li et al., 2023a), Phi-2 (Mar, 2023), Magicoder (Wei et al., 2023), Wav...\n", + "boilerplate from various websites such as navigation menus, cookie warnings, but also frequent high-quality text, our empirical evaluations showed strong improvements. Heuristic filtering. We develop heuristics to remove additional low-quality documents, outliers, and documents with excessive repetitions. Some examples of heuristics include: •We use duplicated n-gram coverage ratio (Rae et al., 2021) to remove lines that consist of repeated content such as logging or error messages. Those lines ...\n", "\n", "PROCESSED TEXT:\n", - "al., 2022a), Alpaca (Taori et al., 2023), Code Alpaca (Chaudhary, 2023) Self-Align (Sun et al., 2024b), WizardLM (Xu et al., 2023a), WizardCoder (Luo et al., 2023a), WizardMath (Luo et al., 2023b), AugGPT (Dai et al., 2023a), TDG (He et al., 2023b), CurationUltraChat (Ding et al., 2023b), Phi-1 (Gunasekar et al., 2023), Phi-1.5 (Li et al., 2023a), Phi-2 (Mar, 2023), Magicoder (Wei et al., 2023), WaveCoder (Yu et al., 2024), ZeroGen (Ye et al., 2022), InPars (Bonifacio et al., 2022)...\n", + "We develop heuristics to remove low-quality documents and outliers. Some examples include using duplicated n-gram coverage ratio to remove repeated content and \"dirty word\" counting to filter out adult websites. We also use a token-distribution Kullback-Leibler divergence to filter out documents with excessive outlier tokens. Further, we experiment with model-based quality classifiers to sub-select documents....\n", "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "\n", "INPUT TEXT:\n", - "(Chen et al., 2023a), GKD (Agarwal et al., 2024) Self-KnowledgeSelf-Instruct (Wang et al., 2022a), Self-Align (Sun et al., 2024b), RLCD (Yang et al., 2024a), ImpDistill (Jung et al., 2023), LMSI (Huang et al., 2023a), ReST (Gulcehre et al., 2023), Self-Rewarding (Yuan et al., 2024a), Baize (Xu et al., 2023b), STaR (Zelikman et al., 2022) DistillationSupervised Fine-TuningAlpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023), WizardLM (Xu et al., 2023a), Self-Instruct (Wang et al., 2022a), Ba...\n", + "high-quality tokens. These include using fast classifiers such as fasttext (Joulin et al., 2017) trained to recognize if a given text would be referenced by Wikipedia (Touvron et al., 2023a), as well as more compute-intensive Roberta-based classifiers (Liu et al., 2019a) trained on Llama 2 predictions. To train a quality classifier based on Llama 2, we create a training set of cleaned web documents, describe the quality requirements, and instruct Llama 2’s chat model to determine if the document...\n", "\n", "PROCESSED TEXT:\n", - "Self-Align (Sun et al., 2024b), RLCD (Yang et al., 2024a), ImpDistill (Jung et al., 2023), LMSI (Huang et al., 2023a), ReST (Gulcehre et al., 2023), Self-Rewarding (Yuan et al., 2024a), Baize (Xu et al., 2023b), STaR (Zelikman et al., 2022) DistillationSupervised Fine-TuningAlpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023), WizardLM (Xu et al., 2023a), Self-Instruct (Wang et al., 2022a), Baize (Xu et al., 2023b), STaR (Zelikman et al., 2022), Divergence and SimilarityDistilGPT (Sanh et a...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "al., 2023), CycleAlign (Hong et al., 2023), Skill DistillationContext FollowingInstruction FollowingSelf-Instruct (Wang et al., 2022a), Alpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023), WizardLM (Xu et al., 2023a), Orca (Mukherjee et al., 2023), Orca 2 (Mitra et al., 2023), WizardMath (Luo et al., 2023b), Llama-GPT4 (Peng et al., 2023a), Multi-turn DialogueVicuna (Chiang et al., 2023), Baize (Xu et al., 2023b), UltraLLaMA (Ding et al., 2023b), CAMEL (Li et al., 2023b), OpenChat (Wang et...\n", - "\n", - "PROCESSED TEXT:\n", - "ollowingInstruction FollowingSelf-Instruct Wang et al., 2022a, Alpaca Taori et al., 2023, Vicuna Chiang et al., 2023, WizardLM Xu et al., 2023a, Orca Mukherjee et al., 2023, Orca2 Mitra et al., 2023, WizardMath Luo et al., 2023b, Llama-GPT4 Peng et al., 2023a, Multi-turn Dialogue Chiang et al., 2023, Baize Xu et al., 2023b, UltraLLaMA Ding et al., 2023b, CAMEL Li et al., 2023b, OpenChat Wang et al., 2023c, Zephyr Tunstall et al., 2023, RAG Kang et al., 2023a, SAIL Luo et al., 2023c, Self-RAG Asa...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "(Lee et al., 2023a), Zephy (Tunstall et al., 2023), UltraFeedback (Cui et al., 2023a), ValueCAI (Bai et al., 2022a), Align Honesty (Yang et al., 2023a), SANDBOX (Liu et al., 2023b), Self-Align (Sun et al., 2024b), UltraFeedback (Cui et al., 2023a), RLCD (Yang et al., 2024a) AgentTool UsingToolformer (Schick et al., 2023), Graph-ToolFormer (Zhang, 2023), Gorilla (Patil et al., 2023), ToolAlpaca (Tang et al., 2023a), ToolLLM (Qin et al., 2023a), CRAFT (Yuan et al., 2023a), Confucius (Gao et al., 2...\n", - "\n", - "PROCESSED TEXT:\n", - "i et al., 2022a), Align Honesty (Yang et al., 2023a), SANDBOX (Liu et al., 2023b), Self-Align (Sun et al., 2024b), UltraFeedback (Cui et al., 2023a), RLCD (Yang et al., 2024a), AgentToolformer (Schick et al., 2023), Graph-ToolFormer (Zhang, 2023), Gorilla (Patil et al., 2023), ToolAlpaca (Tang et al., 2023a), ToolLLM (Qin et al., 2023a), CRAFT (Yuan et al., 2023a), Confucius (Gao et al., 2023b), MLLM-Tool (Wang et al., 2024), α-UMi (Shen et al., 2024), PlanningFireAct (Chen et al., 2023b), Agent...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "2022), NLGInheritSumm (Xu et al., 2023c), RECOMP (Xu et al., 2024b), MaRio (Ramnath et al., 2023), ID (Jung et al., 2023), GPT-3 Labeling (Wang et al., 2021b), BioGPT (Guo et al., 2023a), ChatGPT NMT (Yang and Nicolai, 2023), Information RetrievalQUILL (Srinivasan et al., 2022), Promptgator (Dai et al., 2023b), InPars (Bonifacio et al., 2022), AugTriever (Meng et al., 2023), (Sun et al., 2023a), RankVicuna (Pradeep et al., 2023a), RankZephyr (Pradeep et al., 2023b), ExaRanker (Ferraretto et al.,...\n", - "\n", - "PROCESSED TEXT:\n", - "al., 2023 GPT-3 Labeling Wang et al., 2021b BioGPT Guo et al., 2023a ChatGPT NMT Yang and Nicolai, 2023 Information RetrievalQUILL Srinivasan et al., 2022 Promptgator Dai et al., 2023b InPars Bonifacio et al., 2022 AugTriever Meng et al., 2023 Sun et al., 2023a RankVicuna Pradeep et al., 2023a RankZephyr Pradeep et al., 2023b ExaRanker Ferraretto et al., 2023 Recommendation NDR Mysore et al., 2023 InstrcutRec Zhang et al., 2023b ONCE Liu et al., 2023c Text Generation Evaluation PandaLM Wang et a...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "al., 2024), Code Clean (Jain et al., 2023), Multi-ModalityLLaVA (Liu et al., 2023e), SVIT (Zhao et al., 2023b), LVIS-Instruct4V (Wang et al., 2023e), Shikra (Chen et al., 2023c), LSKD (Park et al., 2023), DetGPT (Pi et al., 2023; Zhao et al., 2023c), LRV (Liu et al., 2023f), NExT-GPT (Wu et al., 2023b), Valley (Luo et al., 2023d), ILuvUI (Jiang et al., 2023d), StableLLaVA (Li et al., 2023c), PointLLM (Xu et al., 2023e), Verticalization DistillationLaw (Huang et al., 2023b; Cui et al., 2023b); Me...\n", - "\n", - "PROCESSED TEXT:\n", - "et al., 2023e), SVIT (Zhao et al., 2023b), LVIS-Instruct4V (Wang et al., 2023e), Shikra (Chen et al., 2023c), LSKD (Park et al., 2023), DetGPT (Pi et al., 2023; Zhao et al., 2023c), LRV (Liu et al., 2023f), NExT-GPT (Wu et al., 2023b), Valley (Luo et al., 2023d), ILuvUI (Jiang et al., 2023d), StableLLaVA (Li et al., 2023c), PointLLM (Xu et al., 2023e), Verticalization DistillationLaw (Huang et al., 2023b; Cui et al., 2023b); Medical & Healthcare (Zhang et al., 2023c; Chen et al., 2023d); Finance...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "earlier methods involved training a smaller student network to mimic the output of a larger teacher network, often through techniques like soft target training, where the student learns from the softened softmax output of the teacher. Please refer to the survey (Gou et al., 2021) for more details on general knowledge distillation techniques in AI and DL. In contrast, the advent of LLMs has revolutionized the knowledge distillation landscape. The current era of knowledge distillation in LLMs shif...\n", - "\n", - "PROCESSED TEXT:\n", - "r network, often through techniques like soft target training, where the student learns from the softened softmax output of the teacher.\n", - "\n", - "The distillation of knowledge from larger models to smaller ones is a technique used to improve the performance of AI models. In this context, distillation refers to the process of distilling the knowledge from a larger model into a smaller model, allowing it to learn from the teacher model's output.\n", - "\n", - "The current era of knowledge distillation in large language...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "replicate the output behavior of the teacher model or reduce the model size , the current focus in LLM-based knowledge distillation is to extract and transfer the rich, nuanced understanding that these models have developed. The key to this modern approach lies in heuristic and carefully designed prompts, which are used to elicit specific knowledge (Ding et al., 2023b) or capabilities (Chaudhary, 2023) from the LLMs. These prompts are crafted to tap into the LLM’s understanding and capabilities ...\n", - "\n", - "PROCESSED TEXT:\n", - "size, the current focus in llm-based knowledge distillation is to extract and transfer the rich, nuanced understanding that these models have developed the key to this modern approach lies in carefully designed prompts that elicit specific knowledge or capabilities from the llms, tapping into their understanding and capabilities in various domains ranging from natural language understanding to more complex cognitive tasks like reasoning and problem-solving...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "of LLMs, where the models exhibit capabilities beyond their explicit training objectives. Furthermore, this era of knowledge distillation also em- phasizes the transfer of more abstract qualities such as reasoning patterns (Mitra et al., 2023), preference align- ment (Cui et al., 2023a), and value alignment (Sun et al., 2024b). This is in stark contrast to the earlier focus on output replication (Taori et al., 2023), indicating a shift towards a more holistic and comprehensive transfer of cognit...\n", - "\n", - "PROCESSED TEXT:\n", - "explicit training objectives. This era of knowledge distillation also emphasizes the transfer of abstract qualities such as reasoning patterns and preference alignment. This is in stark contrast to the earlier focus on output replication, indicating a shift towards a more holistic and comprehensive transfer of cognitive capabilities. The current techniques involve not just the replication of outputs, but also the emulation of thought processes and decision-making patterns of the teacher model. T...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "LLMs, Data Augmentation (DA) (Wang et al., 2022a; Ye et al., 2022) emerges as a critical paradigm integral to the process of knowledge distillation. Unlike traditional DA techniques such as paraphrasing (Gangal et al., 2022) orback-translation (Longpre et al., 2019), which primarily aim at expanding the training dataset in a somewhat mechanical manner. DA within the context of LLMs focuses on the generation of novel, context-rich training data tailored to specific domains and skills. This innova...\n", - "\n", - "PROCESSED TEXT:\n", - "llation, Unlike traditional techniques such as paraphrasing, or back-translation, which primarily aim at expanding the training dataset in a somewhat mechanical manner. DA within the context of LLMs focuses on the generation of novel, context-rich training data tailored to specific domains and skills. This innovation is driven by the unique capabilities of LLMs to generate coherent, diverse, and intricate data samples that closely mimic the nuanced understanding and cognitive abilities of human ...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "as a potent mechanism for bridging the knowl- edge and capability gap between proprietary and open- source models. Through DA, LLMs are prompted to create targeted, high-quality datasets that are not merely larger in volume but are also rich in diversity and specificity. This approach enables the distillation process to be more effec- tive, ensuring that the distilled models not only replicate the teacher model’s output behavior but also embody its deep-seated understanding and cognitive strateg...\n", - "\n", - "PROCESSED TEXT:\n", - "ource models, through Deep Learning Models (LLMs) are prompted to create targeted, high-quality datasets that are not merely larger in volume but also rich in diversity and specificity. This approach enables the distillation process to be more effective, ensuring that the distilled models replicate the teacher model's output behavior and embody its deep-seated understanding and cognitive strategies. The significance and necessity of Data Augmentation (DA) for achieving Knowledge Domains (KD) in ...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "pivotal shift towards a more efficient, sustainable, and accessible approach to harnessing the power of LLMs. It empowers open-source models with the ability to approximate the contextual adeptness, ethical alignment, and deep semantic insights characteristic of their proprietary counterparts, thereby democratizing access to advanced AI capabilities and fostering innovation across a broader spectrum of applications and users. 2.3 Survey Scope Building on the discussions introduced earlier, this ...\n", - "\n", - "PROCESSED TEXT:\n", - "er of LLMs empowers open-source models with the ability to approximate the contextual adeptness, ethical alignment, and deep semantic insights characteristic of their proprietary counterparts thereby democratizing access to advanced AI capabilities and fostering innovation across a broader spectrum of applications and users 2 3 Survey Scope Building on the discussions introduced earlier this survey aims to comprehensively explore the landscape of knowledge distillation within the context of LLMs...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "distillation. KD Algorithms. This segment focuses on the technical foundations and methodologies of knowledge distillation. It includes an in-depth exploration of the processes involved in constructing knowledge from teacher models (e.g., pro- prietary LLMs) and integrating this knowledge into student models (e.g., open-source LLMs). Under the umbrella of ‘knowledge ’, we delve into strategies such as labeling (Hsieh et al., 2023), expansion (Taori et al., 2023), curation (Gu- nasekar et al., 20...\n", - "\n", - "PROCESSED TEXT:\n", - "undations and methodologies of knowledge distillation. It includes an in-depth exploration of processes involved in constructing knowledge from teacher models (e.g., proprietary LLMs) and integrating this knowledge into student models (e.g., open-source LLMs). Under the umbrella of 'knowledge', we delve into strategies such as labeling, expansion, curation, feature understanding, and feedback mechanisms. The exploration seeks to uncover the various ways in which knowledge can be identified, expa...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "et al., 2023a), and rank optimization strategies (Tunstall et al., 2023). This analysis aims to illuminate how these algorithms facilitate the trans- fer of knowledge, ensuring that open-source models can replicate and, in some cases, surpass the capabilities of their proprietary counterparts. Skill Distillation. This facet examines the specific compe- tencies and capabilities enhanced through KD. It encom- passes detailed discussions on context following (Taori et al., 2023; Luo et al., 2023c),...\n", - "\n", - "PROCESSED TEXT:\n", - "ow algorithms enable knowledge transfer, allowing open-source models to replicate and sometimes surpass proprietary capabilities. Skill Distillation examines specific competencies and capabilities enhanced through Knowledge Distillation. Contextual discussions follow (Taori et al., 2023; Luo et al., 2023c), including instruction following and retrieval-augmented generation (RAG) capabilities. Alignment research investigates thinking patterns, persona/preference modeling, and value alignment. The...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "lan- guage generation (NLG), information retrieval, recommen- dation systems, text generation evaluation, and code gen- eration. Finally, the survey addresses multi-modality (Liu et al., 2023e; Zhao et al., 2023b), exploring how KD enhances LLMs’ ability to interpret and integrate multiple forms of input, enriching their utility and applicability across various contexts. Verticalization Distillation. This section assesses the ap- plication of KD across diverse vertical domains, offering insights...\n", - "\n", - "PROCESSED TEXT:\n", - "tion, and Code Generation**\n", - "\n", - "Finally, the survey explores how Knowledge Distillation (KD) enhances Large Language Models (LLMs) in interpreting and integrating multiple forms of input, enriching their utility and applicability across various contexts. Verticalization Distillation\n", - "This section examines the application of KD across diverse domains, providing insights into how distilled LLMs can be tailored for specialized fields such as Law, Medical & Healthcare (Wang et al., 2023a), Finance (Zhan...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "meet the nuanced demands of different industries, thus contributing to the broader AI and ML ecosystem. By navigating through these facets, this survey en- deavors to provide an extensive and nuanced analysis of knowledge distillation in the era of LLMs. It serves as a guide for researchers, practitioners, and enthusiasts in the field, shedding light on current methodologies, challenges, and opportunities for innovation in this rapidly evolving domain. Declaration. This survey represents our ear...\n", - "\n", - "PROCESSED TEXT:\n", - "stem. by navigating through these facets, this survey endeavors to provide an extensive and nuanced analysis of knowledge distillation in the era of LLMs. it serves as a guide for researchers, practitioners, and enthusiasts in the field, shedding light on current methodologies, challenges, and opportunities for innovation in this rapidly evolving domain....\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "foundational paradigms of knowledge dis- tillation, highlighting key methodologies and their impacts across a range of applications. 2.4 Distillation Pipeline in LLM Era SeedKnowledgeSkill/Domain TeacherLLMKnowledgeElicitationStudentModelDistillationAlgorithmsteer driveGeneratedKnowledgeLearningObjectivetrain Fig. 4: An illustration of a general pipeline to distill knowl- edge from a large language model to a student model. The general distillation pipeline of LLMs is a structured and methodical...\n", - "\n", - "PROCESSED TEXT:\n", - "across a range of applications.\n", - "\n", - "Distillation Pipeline in LLM Era\n", - "---------------------------\n", - "\n", - "The Distillation Pipeline is a structured and methodical process aimed at transferring knowledge from a sophisticated teacher model to a less complex student model. This pipeline is integral for leveraging the advanced capabilities of models like GPT-4 or Gemini in more accessible and efficient open-source counterparts.\n", - "\n", - "Stages of Distillation Pipeline\n", - "-----------------------------\n", - "\n", - "1. **Knowledge Eli...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "seen in Figure 2. I. Target Skill or Domain Steering Teacher LLM. The first stage involves directing the teacher LLM towards a specific target skill or domain. This is achieved through care- fully crafted instructions or templates that guide the LLM’s focus. These instructions are designed to elicit responses that demonstrate the LLM’s proficiency in a particular area, be it a specialized domain like healthcare or law, or a skill such as reasoning or language understanding. The objective here is...\n", - "\n", - "PROCESSED TEXT:\n", - "ards a specific target skill or domain This is achieved through carefully crafted instructions or templates that guide the LLM's focus These instructions are designed to elicit responses that demonstrate the LLM's proficiency in a particular area be it a specialized domain like healthcare or law or a skill such as reasoning or language understanding The objective here is to utilize the teacher LLM's extensive training and nuanced capabilities to generate outputs that are rich in the specific kno...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "to generate more elaborate and detailed outputs based on this initial infor- mation. The seed knowledge is crucial as it provides a foundation upon which the teacher model can build and expand, thereby creating more comprehensive and in-depth knowledge examples. III. Generation of Distillation Knowledge. In response to the seed knowledge and steering instructions, the teacher LLM generates knowledge examples. These examples are predominantly in the form of question-and-answer (QA) dialogues or n...\n", - "\n", - "PROCESSED TEXT:\n", - "his initial information the teacher model generates knowledge examples predominantly in the form of question-and-answer dialogues or narrative explanations aligning with the natural language processing understanding capabilities of the 7 LLM these examples are typically in the form of explanations or narratives addressing various topics thereby creating more comprehensive and in-depth knowledge examples the generated knowledge examples constitute the core of the distillation knowledge encapsulat...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "Specific Learn- ing Objective. The final stage involves the utilization of the generated knowledge examples to train the student model. This training is guided by a loss function that aligns with the learning objectives. The loss function quantifies the student model’s performance in replicating or adapting the knowledge from the teacher model. By minimizing this loss, the student model learns to emulate the target skills or domain knowledge of the teacher, thereby acquiring similar capabilities...\n", - "\n", - "PROCESSED TEXT:\n", - "knowledge examples to train the student model. This training is guided by a loss function that aligns with the learning objectives. The loss function quantifies the student model's performance in replicating or adapting the knowledge from the teacher model. By minimizing this loss, the student model learns to emulate the target skills or domain knowledge of the teacher, thereby acquiring similar capabilities....\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "domain to steer the LLM and elicit knowledge, s∼ S denotes an example of the seed knowledge, upon which the LLM can explore to generate novel knowledge, Parse( o, s)stands for to parse the distillation example ( e.g., (x, y)) from the teacher LLM’s output o(plus the input sin some cases), andpTrepresents the teacher LLM with parameters θT. Given the datasets D(kd) Ibuilt for distillation, we then define a learning objective as L=X ILI(D(kd) I;θS), (2) whereP Idenotes there could be multiple task...\n", - "\n", - "PROCESSED TEXT:\n", - "which the LLM can explore to generate novel knowledge, Parse( o, s)stands for to parse the distillation example ( e.g., (x, y)) from the teacher LLM’s output o(plus the input sin some cases), andpTrepresents the teacher LLM with parameters θT. Given the datasets D(kd) Ibuilt for distillation, we then define a learning objective as L=X ILI(D(kd) I;θS), (2) where P Idenotes there could be multiple tasks or skills being distilled into one student model, LI(·;·)stands for a specific learning objecti...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "it is categorized into two principal steps: ‘Knowledge,’ focusing on eliciting knowledge from teacher LLMs (Eq.1), and ‘Distillation,’ centered on injecting this knowledge into student models (Eq.2). We will elaborate on these two processes in the subsequent sections. 3.1 Knowledge This section focuses on the approaches to elicit knowledge from teacher LLMs. According to the manners to acquire knowledge, we divided them into Labeling ,Expansion ,DataCuration ,Feature ,Feedback , and Self-Knowled...\n", - "\n", - "PROCESSED TEXT:\n", - "...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "dataset and feeding it into LLMs to obtain the desired generations. Moreover, the generation of yis controllable through the predefined Iandc. This process can be formulated as follows: D(lab)={x, y|x∼ X, y∼pT(y|I⊕c⊕x)}. (3) Input xcould be sourced from existing NLP task datasets, which serve as typical reservoirs for distillation efforts. Numerous works have sought to harness the capa- bilities of powerful LLMs as teachers for annotating dataset samples across a range of tasks. For instance, ef...\n", - "\n", - "PROCESSED TEXT:\n", - "is process can be formulated as follows: D(lab)={x, y|x∼ X, y∼pT(y|I⊕c⊕x)}....\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "al., 2023; Li et al., 2022; Ho et al., 2023; Magister et al., 2023; Fu et al., 2023; Ramnath et al., 2023; Li et al., 2023d; Liu et al., 2023g), among others. Rather than concentrating on specific tasks, many current works focus on labeling outputs based on instructions, thereby teaching student models to solve tasks in a more flexible way by following in- structions. Collections of various NLP tasks, complemented by instructional templates, serve as valuable input sources forx. For instance, FL...\n", - "\n", - "PROCESSED TEXT:\n", - "works concentrate on labeling outputs based on instructions, teaching student models to solve tasks in a more flexible way by following instructions. Collections of various NLP tasks, complemented by instructional templates, serve as valuable input sources for training models. For instance, FLAN-v2 collections provide extensive publicly available sets of tasks with labeled responses from teacher LLMs, built from predefined templates that lack diversity and may have gaps between human queries....\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "powerful LLMs, like ShareGPT. Additionally, Xu et al. (2023b) and Anand et al. (2023) label the real questions sampled from forums like Quora and Stack Overflow. Moreover, the process of labeling could be guided by instructions Ior demonstrations c. A commonly used in- struction type for guiding labeling is chain-of-thought (CoT) prompt (Hsieh et al., 2023; Fu et al., 2023; Magister et al., 2023). Mukherjee et al. (2023) add multiple system messages (e.g. “You must generate a detailed and long a...\n", - "\n", - "PROCESSED TEXT:\n", - "023b) and Anand et al. (2023) label the real questions sampled from forums like Quora and Stack Overflow. Moreover, the process of labeling could be guided by instructions or demonstrations. A commonly used instruction type for guiding labeling is the chain-of-thought (CoT) prompt. Mukherjee et al. (2023) add multiple system messages (e.g. “You must generate a detailed and long answer.” or “explain like I’m five, think step-by-step”) to elicit rich signals. Yue et al. (2023a) and Chenglin et al....\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "Generate≻≻𝑦\" 𝑦! 𝑦# 𝑥 𝑥& CorrectExpand𝑐 Fig. 5: An illustration of different knowledge elicitation methods from teacher LLMs. Labeling : The teacher generates the output from the input; Expansion : The teacher generates samples similar to the given demonstrations through in- context learning; Data Curation : The teacher synthesizes data according to meta-information, such as a topic or an entity; Feature : Feed the data into the teacher and extract its internal knowledge, such as logits and featu...\n", - "\n", - "PROCESSED TEXT:\n", - "utput from input; Teacher generates samples similar to given demonstrations through in-context learning; Data is curated according to meta-information such as topic or entity; Data is fed into the teacher to extract knowledge such as logits and features; Teacher provides feedback on student's output such as preferences, corrections, and expansions of challenging samples; Student generates outputs which is then filtered for high-quality or evaluated by student itself\"...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "Overflow. 3.1.2 Expansion While the labeling approach is simple and effective, it faces certain limitations. Primarily, it is constrained by the scale and variety of the input data. In real-world applications, especially those involving user conversations, there are also concerns regarding the privacy of the data involved. To address these limitations, various expansion methods have been proposed (Wang et al., 2022a; Taori et al., 2023; Chaud- hary, 2023; Si et al., 2023; Ji et al., 2023a; Luo e...\n", - "\n", - "PROCESSED TEXT:\n", - "s constrained by the scale and variety of the input data. In real-world applications, especially those involving user conversations, there are concerns regarding the privacy of the data involved. Various expansion methods have been proposed to address these limitations. These methods take the demonstrations as seed knowledge and aim to expand a large scale and diverse data by in-context learning....\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "the existing dataset, in the expansion approach, both x andyare generated by teacher LLMs. This process can be formulated as follows: D(exp)={(x, y)|x∼pT(x|I⊕c), y∼pT(y|I⊕x)}.(4) In this formulation, xand yrepresent the new input- output pairs generated by the teacher LLM. The input x is generated based on a set of input-output demonstrations c. The output yis then generated in response to the new input xunder the guidance of an instruction I. Note thatthe demonstrations could be predefined or d...\n", - "\n", - "PROCESSED TEXT:\n", - "...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "subsequent expansion iterations. Subsequently, Taori et al. (2023) applies this ex- pansion method to a more powerful teacher LLM, text- davinci-003, to distill 52K high-quality data. To improve the diversity and coverage during expansion, Wu et al. (2023c) and (Sun et al., 2024b) prompt the teacher LLM to generate instructions corresponding to some specific topics. Xu et al. (2023a) propose an Evol-Instruct method to ex- pand the instructions from two dimensions: difficulty (e.g. rewriting the ...\n", - "\n", - "PROCESSED TEXT:\n", - "to a more powerful teacher LLM, text- davinci-003, to distill 52K high-quality data. To improve the diversity and coverage during expansion, Wu et al. (2023c) and (Sun et al., 2024b) prompt the teacher LLM to generate instructions corresponding to some specific topics. Xu et al. (2023a) propose an Evol-Instruct method to expand the instructions from two dimensions: difficulty (e.g. rewriting the question to be more complex) and diversity (e.g. generating more long-tailed instructions). This Evol...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "multi- ple conceptually similar, but semantically varied, samples to improve classification performance. Similarly, TDG (He et al., 2023b) proposes the Targeted Data Generation (TDG) framework, which automatically identifies challenging sub- groups within data and generates new samples for these subgroups using LLMs through in-context learning. In summary, the expansion method leverages the in- 9 context learning strengths of LLMs to produce more var- ied and extensive datasets with both inputs ...\n", - "\n", - "PROCESSED TEXT:\n", - "TDG framework leverages LLMs' strengths in in-context learning to generate varied and extensive datasets, but quality and diversity rely heavily on teacher LLMs and initial seed demonstrations, leading to bias and homogeneity issues...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "data. 3.1.3 Data Curation The pursuit of high-quality and scalable data generation in knowledge distillation from LLMs has led to the emergence of the Data Curation approach. This method arises in re- sponse to the limitations observed in both the Labeling and Expansion approaches. These methods often yield data of variable quality and face constraints in quantity. In Labeling, the seed knowledge is sourced from task datasets, leading to potential noise and dirty data. Meanwhile, in Expansion, t...\n", - "\n", - "PROCESSED TEXT:\n", - "ed to the emergence of the Data Curation approach. This method arises in response to the limitations observed in both the Labeling and Expansion approaches. These methods often yield data of variable quality and face constraints in quantity.\n", - "\n", - "In Labeling, the seed knowledge is sourced from task datasets, leading to potential noise and dirty data. Meanwhile, in Expansion, the input data is derived from seed demonstrations, which can result in homogeneous data when generated in large quantities. T...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "approach to synthesize data from scratch. Numerous diverse meta- information, such as topics or knowledge points, could be incorporated into this process to generate controllable x andy. Thus, this process can be meticulously controlled to yield datasets that are not only large in scale but also of high quality. The formulation for Data Curation can be represented as: D(cur)={(x, y)|x∼pT(x|I⊕m), y∼pT(y|I⊕x)}.(5) In this formulation, mrepresents the diverse meta- information used to guide the syn...\n", - "\n", - "PROCESSED TEXT:\n", - "edge points, could be incorporated into this process to generate controllable output. Thus, this process can be meticulously controlled to yield datasets that are not only large in scale but also of high quality. The formulation for Data Curation can be represented as: D(cur)={(x, y)|x∼pT(x|I⊕m), y∼pT(y|I⊕x)}. In this formulation, mrepresents the diverse meta-information used to guide the synthesis of x, and Iis the instruction guiding teacher LLMs to generate xory. Different studies primarily v...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "the World , they explore 30 meta-topics like ”Technology” and ”Food and Drink.” the teacher LLMs then use this meta-information to distill a broad array of instructions and conversations, achieving a substantial scale of 1.5 million instances. UltraChat stands out with its lexical and topical diversity. The UltraLLaMA model, fine- tuned on this data, consistently surpasses other open-source models. Another notable series, phi(Gunasekar et al., 2023; Li et al., 2023a; Mar, 2023), focuses on disti...\n", - "\n", - "PROCESSED TEXT:\n", - "ion to distill a broad array of instructions and conversations, resulting in a substantial scale of 1.5 million instances. UltraChat stands out with its lexical and topical diversity, fine-tuned on this data to consistently surpass other open-source models....\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "tokens of Python exercises with solutions. Remarkably, thephi-1 model, despite its smaller size, outperforms nearly all open-source models on coding benchmarks like Hu- manEval and MBPP while being 10 times smaller in model size and 100 times smaller in dataset size. MFTCoder (Liu et al., 2023d) utilizes hundreds of Python knowledge points as meta-information to create a CodeExercise Dataset. In contrast, Magicoder (Wei et al., 2023) and WaveCoder (Yu et al., 2024) get raw code collections from ...\n", - "\n", - "PROCESSED TEXT:\n", - "el outperforms nearly all open-source models on coding benchmarks like HumanEval and MBPP while being 10 times smaller in model size and 100 times smaller in dataset size. MFTCoder (Liu et al., 2023) utilizes hundreds of Python knowledge points as meta-information to create a CodeExercise Dataset. In contrast, Magicoder (Wei et al., 2023) and WaveCoder (Yu et al., 2024) generate instructional data from open-source code collections using this as meta-information for data augmentation. In the cont...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "et al., 2022; Meng et al., 2023). In conclusion, Data Curation through teacher LLMs has emerged as a promising technique for synthesizing datasets that are not only high-quality and diverse but also large in scale. The success of models like phi-1 in specialized domains underscores the efficacy of this method. The ability to create synthetic datasets will become a crucial technical skill and a key area of focus in AI (Li et al., 2023a). 3.1.4 Feature The previously discussed knowledge elicitatio...\n", - "\n", - "PROCESSED TEXT:\n", - "a promising technique for synthesizing datasets that are not only high-quality and diverse but also large in scale. The success of models like phi-1 in specialized domains underscores the efficacy of this method. The ability to create synthetic datasets will become a crucial technical skill and a key area of focus in AI....\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "with fewer than 1 billion parameters (cf. Gou et al. (2021) for detail). However, recent research has begun to explore white-box distillation in the context of generative LLMs (Timiryasov and Tastet, 2023; Liang et al., 2023a; Gu et al., 2024; Agarwal et al., 2024; Liu et al., 2023a; Wen et al., 2023; Wan et al., 2024a; Zhao and Zhu, 2023; Qin et al., 2023b; Boizard et al., 2024; Zhong et al., 2024). The typical method for acquiring this feature knowledge involves teacher LLMs annotating the out...\n", - "\n", - "PROCESSED TEXT:\n", - "to explore white-box distillation in the context of generative LLMs (Timiryasov and Tastet, 2023; Liang et al., 2023a; Gu et al., 2024; Agarwal et al., 2024; Liu et al., 2023a; Wen et al., 2023; Wan et al., 2024a; Zhao and Zhu, 2023; Qin et al., 2023b; Boizard et al., 2024; Zhong et al., 2024). typically involves teacher LLMs annotating the output sequence y with its internal representations. these annotations are then distilled into the student model using methods such as kullback-leibler diver...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "(such as output distri- bution) from the teacher LLM. 10 The most straightforward method to elicit feature knowl- edge of teacher is to label a fixed dataset of sequences with token-level probability distributions (Sanh et al., 2019; Wen et al., 2023). To leverage the rich semantic and syntactic knowledge in intermediate layers of the teacher model, TED (Liang et al., 2023a) designs task-aware layer-wise distillation. They align the student’s hidden representations with those of the teacher at e...\n", - "\n", - "PROCESSED TEXT:\n", - "d to elicit feature knowledge of teacher is to label a fixed dataset of sequences with token-level probability distributions. TED (Liang et al., 2023a) designs task-aware layer-wise distillation. They align the student's hidden representations with those of the teacher at each layer, selectively extracting knowledge pertinent to the target task. Gu et al. (2024) and Agarwal et al. (2024) introduce a novel approach where the student model generates sequences, termed'self-generated sequences'. The...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "distilling feature knowledge from teacher LLMs have been proposed (Tao et al., 2022a; Liu et al., 2023a; Kim et al., 2023b). These methods aim to preserve the original output distribution when quantizing the LLMs, ensuring minimal loss of performance. Additionally, feature knowledge could serve as a potent source for multi-teacher knowledge distil- lation. Timiryasov and Tastet (2023) leverages an ensemble of GPT-2 and LLaMA as teacher models to extract output distributions. Similarly, FuseLLM (...\n", - "\n", - "PROCESSED TEXT:\n", - "n when quantizing LLMs, ensuring minimal loss of performance. Additionally, feature knowledge could serve as a potent source for multi-teacher knowledge distillation....\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "knowledge from teacher LLMs, such as output distributions and intermediate layer features, white- box approaches enable a more nuanced transfer of informa- tion. While showing promise, especially in smaller models, its application is not suitable for black-box LLMs where internal parameters are inaccessible. Furthermore, student models distilled from white-box LLMs may underperform compared to their black-box counterparts, as the black-box teacher LLMs (e.g. GPT-4) tend to be more powerful. 3.1....\n", - "\n", - "PROCESSED TEXT:\n", - "ox approaches enable a more nuanced transfer of information. While showing promise, especially in smaller models, its application is not suitable for black-box LLMs where internal parameters are inaccessible. Furthermore, student models distilled from white-box LLMs may underperform compared to their black-box counterparts, as black-box teacher LLMs tend to be more powerful. 3.1.5 Feedback Most previous works focus on one-way knowledge transfer from the teacher to the student for imitation, with...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "through Reinforcement Learning from AI Feedback (RLAIF) (Bai et al., 2022a). Here is a generalized formulation for eliciting feedback knowledge: D(fb)={(x, y, ϕ fb(x, y;θT))|x∼ X, y∼pS(y|x)}, (7) where ydenotes the output generated by the student model in response to x, and ϕfb(·;θT))represents providing feedback from teacher LLMs. This operation evaluates thestudent’s output ygiven the input x, by offering assess- ment, corrective information, or other forms of guidance. This feedback knowledge...\n", - "\n", - "PROCESSED TEXT:\n", - "2022a). This generalized formulation for eliciting feedback knowledge involves the following steps: \n", - "\n", - "1. D(fb)={(x, y, ϕ fb(x, y;θT))|x∼ X, y∼pS(y|x)}, where ydenotes the output generated by the student model in response to x, and ϕfb(·;θT))represents providing feedback from teacher LLMs. This operation evaluates the student’s output ygiven the input x, by offering assessment, corrective information, or other forms of guidance. This feedback knowledge enables the student to refine its responses ...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "2023; Lee et al., 2023a). Preference, as previously discussed, represents a notable form of feedback knowledge from teacher models. Various knowledge of preferences could be distilled from teachers by prompting it with specific criteria. Bai et al. (2022a) in- troduce RLAIF for distilling harmlessness preferences from LLMs. This involves using an SFT-trained LLM to generate response pairs for each prompt, then ranking them for harmlessness to create a preference dataset. This dataset is distille...\n", - "\n", - "PROCESSED TEXT:\n", - "g proposed to distill them from teacher models. One notable approach is the use of RLAIF, which involves generating response pairs for each prompt and ranking them for harmlessness to create a preference dataset. This dataset is then used to train a more harmless LLM policy, such as Wizard- Math (Luo et al., 2023b), which focuses on mathematical reasoning. To further improve the quality of distilled preference data, researchers have developed the UltraFeedback dataset, a large-scale collection o...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "various instructions and models to produce comparative data. Then, GPT-4 is used to score candidates from various aspects of preference, including instruction-following, truthfulness, honesty and helpfulness. Beyond merely assessing student generations, teachers can also furnish extensive feedback on instances where students underperform. In Lion (Jiang et al., 2023b), teacher model pinpoints instructions that pose challenges to the student model, generating new, more difficult instructions aime...\n", - "\n", - "PROCESSED TEXT:\n", - "s from various aspects of preference, including instruction-following, truthfulness, honesty and helpfulness. Beyond merely assessing student generations, teachers can also furnish extensive feedback on instances where students underperform. In Lion (Jiang et al., 2023b), teacher model pinpoints instructions that pose challenges to the student model, generating new, more difficult instructions aimed at bolstering the student’s abilities. PERsD (Chen et al., 2023a) showcases a method where teache...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "teacher model’s distribution over the student’s generations can itself act as a form of feedback. MiniLLM (Gu et al., 2024) and GKD (Agarwal et al., 2024) present an innovative strategy wherein the student model initially generates sequences, followed by teacher model producing an output distribution as feedback. This method leverages the teacher’s insight to directly inform and refine the student model’s learning process. 3.1.6 Self-Knowledge The knowledge could also be elicited from the studen...\n", - "\n", - "PROCESSED TEXT:\n", - "iniLLM and GKD present an innovative strategy wherein the student model generates sequences, followed by the teacher model producing an output distribution as feedback. This method leverages the teacher’s insight to directly inform and refine the student model’s learning process. 3.1.6 Self-Knowledge The knowledge can be elicited from the student itself, which we refer to as Self-Knowledge. In this setting, the same model acts both as the teacher and the student, iteratively improving itself by ...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "self-knowledge could be formulated as: D(sk)={(x, y, ϕ sk(x, y))|x∼ S, y∼pS(y|I⊕x)},(8) where ϕsk(·)is a generalized function that represents an additional process to the self-generated outputs y, which could include but is not limited to filtering, rewarding, or any other mechanisms for enhancing or evaluating y. It could be governed by external tools or the student itself θS. Recent research in this area has proposed various innovative methodologies to elicit self-knowledge, demonstrating its ...\n", - "\n", - "PROCESSED TEXT:\n", - "at represents an additional process to the self-generated outputs y, which could include but is not limited to filtering, rewarding, or any other mechanisms for enhancing or evaluating y....\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "which utilizes GPT-3 for data augmentation through the Expansion approach, gen- erating additional data samples to enhance the dataset. This enriched dataset subsequently fine-tunes the original model. Other methods aim to elicit targeted knowledge from student models by modifying prompts, and leveraging these data for further refinement. In Self-Align (Sun et al., 2024b), they find that models fine-tuned by Self-Instruct data tend to generate short or indirect responses. They prompt this model ...\n", - "\n", - "PROCESSED TEXT:\n", - "es to enhance the model's capabilities. This process fine-tunes the original model, allowing it to produce more accurate and detailed responses. Other methods aim to elicit targeted knowledge from student models by modifying prompts and leveraging the data for further refinement....\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "reinforcement learning. Several other approaches employ filtering methods to refine self-generated data. For exam- ple, Impossible Distillation (Jung et al., 2023) targets sen- tence summarization tasks, implementing filters based on entailment, length, and diversity to screen self-generated summaries. LMSI (Huang et al., 2023a) generates multiple CoT reasoning paths and answers for each question, and then retains only those paths that lead to the most consistent answer. Note that refined self-k...\n", - "\n", - "PROCESSED TEXT:\n", - "f-generated data. For instance, Impossible Distillation targets sentence summarization tasks, implementing filters based on entailment, length, and diversity to screen self-generated summaries. LMSI generates multiple CoT reasoning paths and answers for each question, and retains only those paths that lead to the most consistent answer. This process enables refined self-knowledge to be iteratively acquired as the student model improves further enhancing its capabilities....\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "and filtered using a scoring function. Subsequently, the lan- guage model undergoes fine-tuning on this curated dataset,employing an offline RL objective. Self-Play (Chen et al., 2024a) introduces a framework resembling iterative DPO, where the language model is fine-tuned to differentiate the self-generated responses from the human-annotated data. These self-generated responses could be seen as “negative knowledge” to promote the student to better align with the target distribution. Self-Reward...\n", - "\n", - "PROCESSED TEXT:\n", - "this curated dataset, employing an offline RL objective. Self-Play (Chen et al., 2024a) introduces a framework resembling iterative DPO, where the language model is fine-tuned to differentiate the self-generated responses from the human-annotated data. These self-generated responses could be seen as “negative knowledge” to promote the student to better align with the target distribution. Self-Rewarding (Yuan et al., 2024a) explores a novel and promising approach by utilizing the language model i...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "range of distillation tech- niques, from the strategies that enhance imitation by Su- pervised Fine-Tuning ,Divergence and Similarity , to advanced methods like Reinforcement Learning and Rank Optimization , as shown in Figure 3. 3.2.1 Supervised Fine-Tuning Supervised Fine-Tuning (SFT), or called Sequence-Level KD (SeqKD) (Kim and Rush, 2016), is the simplest and one of the most effective methods for distilling powerful black-box LLMs. SFT finetunes student model by maximizing the like- lihood ...\n", - "\n", - "PROCESSED TEXT:\n", - "ivergence and similarity, to advanced methods like reinforcement learning and rank optimization, as shown in Figure 3.2.1. Supervised fine-tuning, or sequence-level knowledge distillation (SeqKD), is a simple yet effective method for distilling powerful black-box LLMs....\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "LLMs (Taori et al., 2023; Chiang et al., 2023; Wu et al., 2023c; Xu et al., 2023a; Luo et al., 2023b). Additionally, SFT has been ex- plored in many self-distillation works (Wang et al., 2022a; Huang et al., 2023c; Xu et al., 2023b; Zelikman et al., 2022). Due to the large number of KD works applying SFT, we only list representative ones here. More detailed works can be found in §4. 3.2.2 Divergence and Similarity This section mainly concentrates on algorithms designed for distilling feature kno...\n", - "\n", - "PROCESSED TEXT:\n", - "works....\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "log2p(t) p(t)+q(t)+Pq(t) log2q(t) p(t)+q(t)\u0011 TABLE 1: Functional forms of Dfor various divergence types. p: reference Similarity Function LF Expression L2-Norm Distance ∥ΦT(fT(x, y))−ΦS(fS(x, y))∥2 L1-Norm Distance ∥ΦT(fT(x, y))−ΦS(fS(x, y))∥1 Cross-Entropy Loss −PΦT(fT(x, y)) log(Φ S(fS(x, y))) Maximum Mean Discrepancy MMD (ΦT(fT(x, y)),ΦS(fS(x, y))) TABLE 2: Summary of similarity functions in knowledge distillation. and student models, represented by a general divergence function D: LDiv= E x∼...\n", - "\n", - "PROCESSED TEXT:\n", - "types\n", - "p: reference Similarity Function L2-Norm Distance ∥ΦT(fT(x, y))−ΦS(fS(x, y))∥2 L1-Norm Distance ∥ΦT(fT(x, y))−ΦS(fS(x, y))∥1 Cross-Entropy Loss −PΦT(fT(x, y)) log(Φ S(fS(x, y))) Maximum Mean Discrepancy MMD (ΦT(fT(x, y)),ΦS(fS(x, y)))...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "modes of pT. However, when a student model is unable to learn all modes of a highly complex teacher, the re- sultant “mode-covering” behavior might cause the student to assign probability mass to tokens with low probability under the teacher’s distribution (cf. Figure 6 blue curve). This mode-covering phenomenon can potentially lead to hallucinations and low-quality generations. Alternatively, mode-seeking divergences like reverse KL prioritize tokens where the teacher assigns high probabilities...\n", - "\n", - "PROCESSED TEXT:\n", - "probability mass to tokens with low probability under the teacher's distribution. This can result in hallucinations and low-quality generations. \n", - "\n", - "mode-seeking divergences, such as reverse KL, prioritize tokens with high probabilities, mitigating the risk of low-quality outputs. However, they often come at the cost of reduced diversity. Gu et al. (2024) use policy gradient methods to optimize for this approach, while Agarwal et al. (2024) and Sason and Verd´u (2016) assess the efficacy of differ...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "distillation, finding the optimal divergence to be task-dependent. For instance, forward KL divergence is more suitable for tasks like Machine Translation, where the output has fewer modes or variations, while reverse KL divergence is preferable for tasks like dialogue generation and instruction tuning, which involve multiple modes and a wider range of potential responses. Thus, the nature of the task significantly influences the selection of the divergence function for optimal performance. Simi...\n", - "\n", - "PROCESSED TEXT:\n", - "nce is more suitable for tasks like machine translation, where the output has fewer modes or variations, while reverse KL divergence is preferable for tasks like dialogue generation and instruction tuning, which involve multiple modes and a wider range of potential responses. Thus, the nature of the task significantly influences the selection of the divergence function for optimal performance. Similarity-based methods in knowledge distillation aim to align the hidden states or features of the st...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "“mode-seeking” behavior. model with those of the teacher. These methods use various similarity metrics to measure and optimize the congruence of internal representations between the two models. The objective is to ensure that the student model not only produces similar outputs to the teacher but also processes information in a comparable manner. The formulation for a similarity-based objective might look like this: LSim= E x∼X,y∼Y[LF(ΦT(fT(x, y)),ΦS(fS(x, y)))],(11) where fT(x, y)andfS(x, y)are ...\n", - "\n", - "PROCESSED TEXT:\n", - "ntations...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "task-aware filters. These filters are designed to selectively capture the most pertinent informa- tion for a specific task from the teacher model. The key objective is to minimize the discrepancy between the filtered representations in both teacher and student models. While similarity-based approaches are common in encoder-based LMs (Sun et al., 2019, 2020; Jiao et al., 2020; Hou et al., 2020; Zuo et al., 2022; Liang et al., 2021), their application in LLM knowledge distillation is not as widesp...\n", - "\n", - "PROCESSED TEXT:\n", - "on for a specific task from the teacher model. The key objective is to minimize the discrepancy between the filtered representations in both teacher and student models. While similarity-based approaches are common in encoder-based LMs, their application in LLM knowledge distillation is not as widespread. However, considering their effectiveness, we anticipate an increase in research exploring these methods for LLM distillation in the near future.\n", - "\n", - "3.2.3 Reinforcement Learning This section explor...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "2024b; Ma et al., 2023a; Pang et al., 2023; Du et al., 2023a). The RL-based distillation process typically involves two main stages: 13 Distilled Reward Model Training. The first stage involves training a reward model rϕusing the feedback data D(fd) generated by teacher LLMs. Preference data, as one of the typical feedback, is employed to train the student reward model (Bai et al., 2022a; Cui et al., 2023a; Lee et al., 2023a; Kim et al., 2023a). They usually consist of input-output pairs (x, yw,...\n", - "\n", - "PROCESSED TEXT:\n", - "3 Distilled Reward Model Training. First stage involves training a reward model ϕ using feedback data D(fd) generated by teacher LLMs. Preference data, one of typical feedback, is used to train the student reward model. This typically consists of input-output pairs (x, yw, yl). Here, ywandyl represent \"winning\" and \"losing\" outputs relative to the teacher's preferences. Loss function for the reward model is defined as: LRM(rϕ,D(fd)) = - E (x,yw,yl) ∼D(fd)[logσ(rϕ(x, yw) - rϕ(x, yl))]...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "model. It is trained on an erroneous solution rewriting data distilled from a teacher LLM. This distilled reward model can pro- duce token-level rewards for RL training. Reinforcement Learning Optimization. In the second stage, the student model, represented by a policy πθ, is optimized to maximize the expected reward as per the trained reward model. Simultaneously, it minimizes the divergence from a reference policy πref, typically the initial policy of the student model trained by SFT, control...\n", - "\n", - "PROCESSED TEXT:\n", - "reward model can pro- duce token-level rewards for RL training. Reinforcement Learning Optimization. In the second stage, the student model, represented by a policy πθ, is optimized to maximize the expected reward as per the trained reward model. Simultaneously, it minimizes the divergence from a reference policy πref, typically the initial policy of the student model trained by SFT, controlled by a factor β. The RL objective is given by: max πθE x∼X,y∼πθ(y|x)[rϕ(x, y)]−βDKL[πθ(y|x)∥πref(y|x)] (...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "reward model to directly assign rewards during RL, circumventing the need for training a reward model (Lee et al., 2023a; Kwon et al., 2023). While this approach may exhibit superior performance, it comes at a higher computational cost compared to employing a smaller distilled reward model. 3.2.4 Ranking Optimization Ranking optimization presents a stable and computationally efficient alternative to RL for injecting preference feedback into language models (Rafailov et al., 2023; Song et al., 20...\n", - "\n", - "PROCESSED TEXT:\n", - "del\n", - "\n", - "While this approach may exhibit superior performance, it comes at a higher computational cost compared to employing a smaller distilled reward model\n", - "\n", - "Ranking optimization presents a stable and computationally efficient alternative to RL for injecting preference feedback into language models\n", - "\n", - "This method, diverging from traditional RL approaches, directly incorporates ranking information into language models from a fixed preference dataset during fine-tuning\n", - "\n", - "Intuitively, it directly updates...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "ranking optimization todistill teacher’s preferences into student models (Tunstall et al., 2023; Hong et al., 2023; Yuan et al., 2024a). Zephyr (Tunstall et al., 2023) utilizes Direct Preference Optimization (DPO) (Rafailov et al., 2023) to distill the preference alignment in teacher LLMs. DPO streamlines the objective of reinforcement learning (as in Eq. 13), which involves reward maximization with a KL-divergence constraint, into a single-stage policy training. Specifically, DPO’s training goa...\n", - "\n", - "PROCESSED TEXT:\n", - "ng et al., 2023; Yuan et al., 2024a). Zephyr (Tunstall et al., 2023) utilizes Direct Preference Optimization (DPO) (Rafailov et al., 2023) to distill the preference alignment in teacher LLMs. DPO streamlines the objective of reinforcement learning (as in Eq. 13), which involves reward maximization with a KL-divergence constraint, into a single-stage policy training. Specifically, DPO’s training goal is to maximize the following expectation: E (x,yw,yl)∼D fd logπθ(yw|x) πref(yw|x)−βlogπθ(yl|x) πr...\n", - "==========================================================================================\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INPUT TEXT:\n", - "LRRHF =X ri