u

HDembinski · Jan 28, 2025 · 049a429 · 049a429
1 parent 0be5263
commit 049a429
Showing 1 changed file with 9 additions and 7 deletions.
diff --git a/posts/parsing_webpages_with_llm.ipynb b/posts/parsing_webpages_with_llm.ipynb
@@ -302,13 +302,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Update: Using distilled DeepSeek R1\n",
+    "## Update: Using a distilled DeepSeek-R1 model\n",
     "\n",
-    "[DeepSeek R1](https://arxiv.org/abs/2501.12948) is a new Open Source reasoning model which uses test-time compute to improve its reasoning, like OpenAI's o1 model. Basically, it has the chain-of-thought prompting technique hard-wired and will generate a thinking process before answering. In several benchmarks it reaches the same performance as o1.\n",
+    "[DeepSeek-R1](https://arxiv.org/abs/2501.12948) is a new Open Source reasoning model which uses test-time compute to improve its reasoning, like OpenAI's o1 model. Basically, it has the chain-of-thought prompting technique hard-wired and will generate a thinking process before answering. In several benchmarks it reaches the same performance as o1.\n",
     "\n",
-    "While the full DeepSeek R1 model with over 600b parameters is too large to be run locally, the authors provide distilled small models. We use one of these here based on Llama with 8b parameters. I also tried another version based on Qwen2 with 14b parameters, but it is consistently crashing the Ollama server after a while.\n",
+    "While the full DeepSeek-R1 model with over 600b parameters is too large to be run locally, the authors provide distilled smaller models. I use one of these based on Llama-3.1 with 8b parameters, the same architecture that `llama3-chatqa` uses. I also tried another version based on Qwen2 with 14b parameters, but it is consistently crashing the Ollama server after a while.\n",
     "\n",
-    "To use it for our task, we need to skip over the thinking process and keep only the final output. We expect the model to be smarter in following our instructions, especially the rule about shortening long lists of authors."
+    "To use the output of this model for our task, we need to skip over the thinking process and keep only the final output. That is easy to do, because the model is trained to always put its thinking in `<think>` tags.\n",
+    "\n",
+    "We expect the model to be smarter in following our requirements, especially the rule about shortening long lists of authors, which requires reasoning."
    ]
   },
   {
@@ -358,14 +360,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The model is indeed better at counting the authors and appropriately replacing long author lists with the \"et al.\" form. It even managed once to follow the instruction to convert LaTeX code into a proper text description by replacing `\\sqrt{s}` with `√s`, which is quite impressive. Overall, the output is more consistent than with `llama3-chatqa`, but the model still makes a few minor and major mistakes.\n",
+    "The model is indeed better at counting the authors and appropriately replacing long author lists with the \"et al.\" form. It even managed once to follow the instruction to convert LaTeX code into a proper text description by replacing `\\sqrt{s}` with `√s`, which is impressive. Overall, the output is more consistent than with `llama3-chatqa`, but the model still makes a few minor and major mistakes.\n",
     "\n",
     "- It occasionally produces an invalid URL, omitting the `https:` prefix. `llama3-chatqa` never makes that mistake.\n",
     "- It does not always follow the instruction to remove all emphasis markup from the reference. That can be rectified by post-processing in this case.\n",
     "- In one case, it swapped the order of title and journal.\n",
-    "- Similar to `llama3-chatqa`, it usually hallucinates an invalid journal and DOI for the last paper, because it was only released on arXiv, but it got it occasionally right.\n",
+    "- Similar to `llama3-chatqa`, it usually hallucinates an invalid journal and DOI for the last paper, because it was only released on arXiv, but occasionally it got it right, too.\n",
     "\n",
-    "In conclusion, the distilled `deepseek-r1` model performs slightly better at this task, although it is using the same architecture as `llama3-chatqa`, at the cost of using 6x more compute. It would be interesting to see whether version that is less quantized performs better, in which case the errors could be attributed to \"noise\" in the reasoning process, or whether the main issue is the limited attention capabilities of 8b models, which have fewer attention blocks compared to larger models."
+    "In conclusion, the distilled `deepseek-r1` model performs slightly better at this task, although it is using the same architecture as `llama3-chatqa`, at the cost of using 6x more compute. It would be interesting to see whether a version that is less quantized performs better, in which case the errors could be attributed to \"noise\" in the reasoning process, or whether the main issue is the limited attention capability of 8b models, which have fewer attention blocks compared to larger models."
    ]
   },
   {