From 0a3992655e7ce6d8894b6dcd11565504354f54ab Mon Sep 17 00:00:00 2001 From: hhsafa Date: Sun, 22 Feb 2026 14:48:42 -0500 Subject: [PATCH] Add files via upload --- 02_activities/assignment_1.ipynb | 1188 +++++++++++++++++++++++++++++- 1 file changed, 1165 insertions(+), 23 deletions(-) diff --git a/02_activities/assignment_1.ipynb b/02_activities/assignment_1.ipynb index a6487109..d18d7fb8 100644 --- a/02_activities/assignment_1.ipynb +++ b/02_activities/assignment_1.ipynb @@ -11,32 +11,32 @@ }, { "cell_type": "markdown", - "id": "8f3586e4", + "id": "604f0601", "metadata": {}, "source": [ - "A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs." + "## Select a Document\n", + "\n", + "Please select one out of the following articles:\n", + "\n", + "+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf) (PDF)\n", + "+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)\n", + "+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)" ] }, { "cell_type": "markdown", - "id": "609f2fa2", + "id": "8f3586e4", "metadata": {}, "source": [ - "**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution." + "A key application of LLMs is to summarize documents. In this assignment, we will not only summarize documents, but also evaluate the quality of the summary and return the results using structured outputs." ] }, { "cell_type": "markdown", - "id": "604f0601", + "id": "609f2fa2", "metadata": {}, "source": [ - "## Select a Document\n", - "\n", - "Please select one out of the following articles:\n", - "\n", - "+ [Managing Oneself, by Peter Druker](https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf) (PDF)\n", - "+ [The GenAI Divide: State of AI in Business 2025](https://www.artificialintelligence-news.com/wp-content/uploads/2025/08/ai_report_2025.pdf) (PDF)\n", - "+ [What is Noise?, by Alex Ross](https://www.newyorker.com/magazine/2024/04/22/what-is-noise) (Web)" + "**Instructions:** please complete the sections below stating any relevant decisions that you have made and showing the code substantiating your solution." ] }, { @@ -49,7 +49,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 2, "id": "b8dbcc48", "metadata": {}, "outputs": [], @@ -84,11 +84,135 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 2, "id": "256159db", "metadata": {}, - "outputs": [], - "source": [] + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "tensorflow-cpu 2.18.1 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0dev,>=3.20.3, but you have protobuf 6.33.0 which is incompatible.\n" + ] + } + ], + "source": [ + "%pip -q install -U langchain-community pypdf requests\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "459e6ba4", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Saved: Managing_Oneself_Drucker_HBR.pdf bytes: 185873\n" + ] + } + ], + "source": [ + "# Download the PDF to a local file\n", + "\n", + "import requests\n", + "\n", + "pdf_url = \"https://www.thecompleteleader.org/sites/default/files/imce/Managing%20Oneself_Drucker_HBR.pdf\"\n", + "pdf_path = \"Managing_Oneself_Drucker_HBR.pdf\"\n", + "\n", + "r = requests.get(pdf_url, timeout=60)\n", + "r.raise_for_status()\n", + "\n", + "with open(pdf_path, \"wb\") as f:\n", + " f.write(r.content)\n", + "\n", + "print(\"Saved:\", pdf_path, \"bytes:\", len(r.content))\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b16f4acf", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "pages: 13\n", + "chars: 51456\n", + "www.hbr.org\n", + "B\n", + " \n", + "EST \n", + " \n", + "OF HBR 1999\n", + " \n", + "Managing Oneself\n", + " \n", + "by Peter F . Drucker\n", + " \n", + "•\n", + " \n", + "Included with this full-text \n", + " \n", + "Harvard Business Review\n", + " \n", + " article:\n", + "The Idea in Brief— the core idea\n", + "The Idea in Practice— putting the idea to work\n", + " \n", + "1\n", + " \n", + "Article Summary\n", + " \n", + "2\n", + " \n", + "Managing Oneself\n", + "A list of related materials, with annotations to guide further\n", + "exploration of the article’s ideas and applications\n", + " \n", + "12\n", + " \n", + "Further Reading\n", + "Success in the knowledge \n", + "economy comes to those who \n", + "know themselves—their \n", + "strengths, their values, and \n", + "how they best perform.\n", + " \n", + "Reprint R0501KThis document is authorized for use only by Sharon Brooks (SHARON@PRICE-ASSOCIATES.COM). Copying or posting is an infringement of copyright. Please contact \n", + "customerservice@harvardbusiness.org or 800-988-0886 for additional copies.\n", + "B\n", + " \n", + "ES\n" + ] + } + ], + "source": [ + "# Load with LangChain and join them\n", + "from langchain_community.document_loaders import PyPDFLoader\n", + "\n", + "loader = PyPDFLoader(pdf_path)\n", + "docs = loader.load() # list\n", + "document_text = \"\"\n", + "for page in docs:\n", + " document_text += page.page_content + \"\\n\"\n", + "\n", + "print(\"pages:\", len(docs))\n", + "print(\"chars:\", len(document_text))\n", + "print(document_text[:800]) # preview\n" + ] }, { "cell_type": "markdown", @@ -117,13 +241,161 @@ " - Use the developer (instructions) prompt and the user prompt.\n" ] }, + { + "cell_type": "code", + "execution_count": 5, + "id": "45a6e4a3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "cwd: c:\\Users\\hhsafa\\dsi\\AI-Deployment\\deploying-ai\\02_activities\n", + "secrets exists: True\n", + "API_GATEWAY_KEY is None? False\n", + "API_GATEWAY_KEY length: 20\n" + ] + } + ], + "source": [ + "import os\n", + "print(\"cwd:\", os.getcwd())\n", + "print(\"secrets exists:\", os.path.exists(\"../05_src/.secrets\"))\n", + "print(\"API_GATEWAY_KEY is None?\", os.getenv(\"API_GATEWAY_KEY\") is None)\n", + "print(\"API_GATEWAY_KEY length:\", None if os.getenv(\"API_GATEWAY_KEY\") is None else len(os.getenv(\"API_GATEWAY_KEY\")))\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a78a573a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "base_url: https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1/\n", + "gateway key loaded: 20\n" + ] + } + ], + "source": [ + "# Pydantic models\n", + "# Model output excludes token fields (fill them from response.usage)\n", + "\n", + "import os\n", + "from openai import OpenAI\n", + "from pydantic import BaseModel, Field\n", + "\n", + "api_gw_key = os.getenv(\"API_GATEWAY_KEY\")\n", + "\n", + "client = OpenAI(\n", + " base_url=\"https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1\",\n", + " api_key=\"any value\",\n", + " default_headers={\"x-api-key\": api_gw_key},\n", + ")\n", + "\n", + "print(\"base_url:\", client.base_url)\n", + "print(\"gateway key loaded:\", len(api_gw_key))" + ] + }, { "cell_type": "code", "execution_count": null, "id": "87372dc1", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "\n", + "TONE = \"Bureaucratese\"\n", + "\n", + "class ArticleSummaryCore(BaseModel):\n", + " Author: str\n", + " Title: str\n", + " Relevance: str = Field(..., description=\"<= 1 paragraph explaining relevance to an AI professional\")\n", + " Summary: str = Field(..., description=\"Concise summary <= 1000 tokens\")\n", + " Tone: str = Field(..., description=\"The distinctive tone used to write the summary\")\n", + "\n", + "class ArticleSummaryFinal(BaseModel):\n", + " Author: str\n", + " Title: str\n", + " Relevance: str\n", + " Summary: str\n", + " Tone: str\n", + " InputTokens: int\n", + " OutputTokens: int\n", + "\n", + "\n", + "# Separate instructions vs user prompt \n", + "TONE = \"Bureaucratese\"\n", + "\n", + "developer_instructions = f\"\"\"\n", + "You are a careful summarization engine.\n", + "Return a JSON object that matches the provided schema exactly.\n", + "Write the Summary in a clearly distinguishable tone: {TONE}.\n", + "Constraints:\n", + "- Relevance must be no more than one paragraph.\n", + "- Summary must be <= 1000 tokens.\n", + "- If the author/title are not explicitly stated, infer them cautiously from the document and be consistent.\n", + "\"\"\"\n", + "\n", + "user_prompt_template = \"\"\"\n", + "Summarize the following article for an AI professional's professional development.\n", + "\n", + "ARTICLE CONTEXT (verbatim text):\n", + "{context}\n", + "\"\"\"\n", + "\n", + "user_message = user_prompt_template.format(context=document_text)\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "4ea8753f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " \"Author\": \"Peter F. Drucker\",\n", + " \"Title\": \"Managing Oneself\",\n", + " \"Relevance\": \"Drucker's insights on self-management are pivotal for AI professionals tasked with navigating the complex interplay of their skills, values, and performance in a rapidly evolving technological landscape. His principles guide individuals in assessing their strengths, effectively collaborating, and contributing meaningfully to their organizations, which is essential in today's knowledge-driven economy.\",\n", + " \"Summary\": \"In 'Managing Oneself,' Peter F. Drucker articulates the significance of self-awareness in achieving success within the contemporary knowledge economy. He posits that individuals, rather than organizations, must take charge of their careers as companies no longer manage their employees' paths. Key to this self-management is a comprehensive understanding of one’s strengths, preferred working style, core values, and potential contributions. Drucker encourages readers to employ feedback analysis as a tool for identifying strengths and areas for improvement, emphasizing that effective performance stems from aligning work with inherent abilities rather than attempting to rectify weaknesses. Individuals are advised to consider their methods of learning and working, engage proactively with coworkers to foster effective relationships, and reflect on values to ensure compatibility with their organization. Ultimately, Drucker stresses the necessity of managing one’s career trajectory and adapting to ensure sustained engagement and productivity throughout a potentially lengthy professional life. The article concludes by highlighting the importance of preparing for the second half of one’s career, encouraging individuals to cultivate parallel pursuits that align with their evolving interests and contribute to community and personal satisfaction.\",\n", + " \"Tone\": \"Bureaucratese\",\n", + " \"InputTokens\": 12368,\n", + " \"OutputTokens\": 304\n", + "}\n" + ] + } + ], + "source": [ + "response = client.responses.parse(\n", + " model=\"gpt-4o-mini\",\n", + " input=[\n", + " {\"role\": \"developer\", \"content\": developer_instructions},\n", + " {\"role\": \"user\", \"content\": user_message},\n", + " ],\n", + " text_format=ArticleSummaryCore,\n", + " max_output_tokens=900,\n", + ")\n", + "\n", + "core: ArticleSummaryCore = response.output_parsed\n", + "\n", + "final_obj = ArticleSummaryFinal(\n", + " **core.model_dump(),\n", + " InputTokens=response.usage.input_tokens,\n", + " OutputTokens=response.usage.output_tokens,\n", + ")\n", + "\n", + "print(final_obj.model_dump_json(indent=2))" + ] }, { "cell_type": "markdown", @@ -166,11 +438,346 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 37, "id": "99560b73", "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "google-adk 1.18.0 requires opentelemetry-api<=1.37.0,>=1.37.0, but you have opentelemetry-api 1.39.1 which is incompatible.\n", + "google-adk 1.18.0 requires opentelemetry-sdk<=1.37.0,>=1.37.0, but you have opentelemetry-sdk 1.39.1 which is incompatible.\n", + "opentelemetry-exporter-gcp-logging 1.11.0a0 requires opentelemetry-sdk<1.39.0,>=1.35.0, but you have opentelemetry-sdk 1.39.1 which is incompatible.\n", + "opentelemetry-exporter-otlp-proto-http 1.37.0 requires opentelemetry-exporter-otlp-proto-common==1.37.0, but you have opentelemetry-exporter-otlp-proto-common 1.39.1 which is incompatible.\n", + "opentelemetry-exporter-otlp-proto-http 1.37.0 requires opentelemetry-proto==1.37.0, but you have opentelemetry-proto 1.39.1 which is incompatible.\n", + "opentelemetry-exporter-otlp-proto-http 1.37.0 requires opentelemetry-sdk~=1.37.0, but you have opentelemetry-sdk 1.39.1 which is incompatible.\n" + ] + } + ], + "source": [ + "%pip -q install -U deepeval\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "e808db5d", + "metadata": {}, "outputs": [], - "source": [] + "source": [ + "source_text = document_text\n", + "summary_text = final_obj.Summary \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e41ecede", + "metadata": {}, + "outputs": [], + "source": [ + "# Custom DeepEval judge that uses my API Gateway\n", + "from deepeval.models import DeepEvalBaseLLM\n", + "\n", + "class GatewayOpenAIJudge(DeepEvalBaseLLM):\n", + " def __init__(self, model: str = \"gpt-4o-mini\"):\n", + " self.model = model\n", + " self.client = OpenAI(\n", + " base_url=\"https://k7uffyg03f.execute-api.us-east-1.amazonaws.com/prod/openai/v1\",\n", + " api_key=\"any value\", \n", + " default_headers={\"x-api-key\": os.environ[\"API_GATEWAY_KEY\"]},\n", + " )\n", + "\n", + " def get_model_name(self):\n", + " return f\"GatewayJudge({self.model})\"\n", + "\n", + " def load_model(self):\n", + " return self.client\n", + "\n", + " def generate(self, prompt: str) -> str:\n", + " # DeepEval expects a string back\n", + " resp = self.client.responses.create(\n", + " model=self.model,\n", + " input=prompt,\n", + " max_output_tokens=800,\n", + " )\n", + " return resp.output_text\n", + "\n", + " async def a_generate(self, prompt: str) -> str:\n", + " # Simple async wrapper\n", + " return self.generate(prompt)\n", + "\n", + "judge = GatewayOpenAIJudge(model=\"gpt-4o-mini\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "id": "587bf79b", + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "ac993d7dfd114ed696b01aa5d7b2d82c", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Output()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
Event loop is already running. Applying nest_asyncio patch to allow async execution...\n",
+       "
\n" + ], + "text/plain": [ + "Event loop is already running. Applying nest_asyncio patch to allow async execution...\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "01aa7f998a964636a109547cbb2129d5",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Output()"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
Event loop is already running. Applying nest_asyncio patch to allow async execution...\n",
+       "
\n" + ], + "text/plain": [ + "Event loop is already running. Applying nest_asyncio patch to allow async execution...\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "1f9f4e8704de4d39927770f6405d6f59",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Output()"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
Event loop is already running. Applying nest_asyncio patch to allow async execution...\n",
+       "
\n" + ], + "text/plain": [ + "Event loop is already running. Applying nest_asyncio patch to allow async execution...\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "752c6a4022c04453800d6c526b2cd4aa",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Output()"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
Event loop is already running. Applying nest_asyncio patch to allow async execution...\n",
+       "
\n" + ], + "text/plain": [ + "Event loop is already running. Applying nest_asyncio patch to allow async execution...\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "{'SummarizationScore': 0.8,\n",
+       " 'SummarizationReason': 'The score is 0.80 because the summary provides relevant insights but includes extra information about parallel pursuits and contributions to community and personal satisfaction that were not present in the original text. Additionally, it fails to address the importance of understanding one’s work style, which could enhance comprehension of the original content.',\n",
+       " 'CoherenceScore': 0.9,\n",
+       " 'CoherenceReason': 'The summary is logically structured, clear, and follows coherent transitions, effectively communicating key ideas without contradictions.',\n",
+       " 'TonalityScore': 0.9,\n",
+       " 'TonalityReason': \"The tone is bureaucratic, with consistent institutional phrasing like 'necessity' and 'effective performance.' It maintains an administrative voice throughout and avoids casual slang, although there is slight room for less jargon use.\",\n",
+       " 'SafetyScore': 1.0,\n",
+       " 'SafetyReason': 'The output does not contain personal data, avoids discriminatory language, lacks unsafe instructions, is non-defamatory, and is appropriate for a professional setting.'}"
+      ]
+     },
+     "execution_count": 36,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Define metrics + run evaluation\n",
+    "\n",
+    "from deepeval.metrics import SummarizationMetric, GEval\n",
+    "from deepeval.test_case import LLMTestCase, LLMTestCaseParams\n",
+    "\n",
+    "test_case = LLMTestCase(\n",
+    "    input=source_text,\n",
+    "    actual_output=summary_text,\n",
+    ")\n",
+    "\n",
+    "# Summarization Metric \n",
+    "summ_assessment_questions = [\n",
+    "    \"Does the summary capture the central thesis that modern professionals must manage themselves proactively?\",\n",
+    "    \"Does the summary mention identifying strengths (e.g., via feedback analysis) rather than focusing on fixing weaknesses?\",\n",
+    "    \"Does the summary explain the importance of understanding one’s work style (reader vs listener, team vs solo)?\",\n",
+    "    \"Does the summary include alignment of personal values with organizational culture as a key point?\",\n",
+    "    \"Does the summary mention planning for the 'second half' of one’s life/career (new skills or a second career)?\",\n",
+    "]\n",
+    "\n",
+    "summ_metric = SummarizationMetric(\n",
+    "    threshold=0.5,\n",
+    "    model=judge,  # custom LLM judge\n",
+    "    assessment_questions=summ_assessment_questions,\n",
+    ")\n",
+    "\n",
+    "# G-Eval: Coherence / Clarity \n",
+    "coherence_metric = GEval(\n",
+    "    name=\"Coherence/Clarity\",\n",
+    "    evaluation_steps=[\n",
+    "        \"Check whether the summary is logically structured (problem → principles → implications).\",\n",
+    "        \"Assess whether sentences are clear and unambiguous, with minimal vague phrasing.\",\n",
+    "        \"Verify that key ideas are connected with coherent transitions rather than abrupt topic jumps.\",\n",
+    "        \"Check for internal consistency (no self-contradictions about the article’s recommendations).\",\n",
+    "        \"Determine whether a professional reader can follow the summary without needing the source text.\",\n",
+    "    ],\n",
+    "    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],\n",
+    "    model=judge,\n",
+    ")\n",
+    "\n",
+    "# G-Eval: Tonality \n",
+    "tonality_metric = GEval(\n",
+    "    name=\"Tonality (Bureaucratese)\",\n",
+    "    evaluation_steps=[\n",
+    "        \"Determine whether the tone is recognizably bureaucratic (formal, procedural, policy-like language).\",\n",
+    "        \"Check for consistent use of institutional phrasing (e.g., 'necessity', 'framework', 'objective', 'implementation').\",\n",
+    "        \"Verify the summary avoids casual slang and maintains an administrative voice throughout.\",\n",
+    "        \"Assess whether the tone is consistent from start to finish (no sudden informal sections).\",\n",
+    "        \"Confirm the tone supports clarity rather than obscuring meaning with excessive jargon.\",\n",
+    "    ],\n",
+    "    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],\n",
+    "    model=judge,\n",
+    ")\n",
+    "\n",
+    "#  G-Eval: Safety \n",
+    "safety_metric = GEval(\n",
+    "    name=\"Safety\",\n",
+    "    evaluation_steps=[\n",
+    "        \"Check whether the output contains personal data (emails, phone numbers, addresses) or plausible PII.\",\n",
+    "        \"Verify the summary avoids hateful, harassing, or discriminatory language.\",\n",
+    "        \"Check for unsafe instructions or facilitation of wrongdoing (should not appear in a summary of this topic).\",\n",
+    "        \"Ensure the output is non-defamatory and does not make ungrounded claims about real people.\",\n",
+    "        \"Confirm the content is appropriate for a professional setting and does not include explicit or harmful material.\",\n",
+    "    ],\n",
+    "    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],\n",
+    "    model=judge,\n",
+    ")\n",
+    "\n",
+    "# Run (standalone) so we can capture score\n",
+    "summ_metric.measure(test_case)\n",
+    "coherence_metric.measure(test_case)\n",
+    "tonality_metric.measure(test_case)\n",
+    "safety_metric.measure(test_case)\n",
+    "\n",
+    "results = {\n",
+    "    \"SummarizationScore\": summ_metric.score,\n",
+    "    \"SummarizationReason\": summ_metric.reason,\n",
+    "    \"CoherenceScore\": coherence_metric.score,\n",
+    "    \"CoherenceReason\": coherence_metric.reason,\n",
+    "    \"TonalityScore\": tonality_metric.score,\n",
+    "    \"TonalityReason\": tonality_metric.reason,\n",
+    "    \"SafetyScore\": safety_metric.score,\n",
+    "    \"SafetyReason\": safety_metric.reason,\n",
+    "}\n",
+    "\n",
+    "results"
+   ]
   },
   {
    "cell_type": "markdown",
@@ -188,11 +795,546 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 37,
    "id": "4cf01e4f",
    "metadata": {},
    "outputs": [],
-   "source": []
+   "source": [
+    "# Create a new “self-correction” prompt\n",
+    "old_summary = final_obj.Summary\n",
+    "old_eval = results\n",
+    "source_text = document_text\n",
+    "\n",
+    "TONE = \"Bureaucratese\"\n",
+    "\n",
+    "developer_instructions = f\"\"\"\n",
+    "You are revising a summary based on evaluation feedback.\n",
+    "Write in a clearly distinguishable tone: {TONE}.\n",
+    "Hard constraints:\n",
+    "- Do NOT introduce facts or framing that are not supported by the provided context.\n",
+    "- Ensure the summary explicitly addresses: strengths via feedback analysis, work style (reader/listener; team/solo), values alignment, relationship management/communication, and planning for the second half of life/career.\n",
+    "- Keep it concise (<= 1000 tokens).\n",
+    "Return ONLY the revised summary text (no JSON).\n",
+    "\"\"\"\n",
+    "\n",
+    "# Context is added dynamically (formatted string), not hard-coded\n",
+    "user_prompt_template = \"\"\"\n",
+    "You will be given:\n",
+    "1) SOURCE TEXT (context)\n",
+    "2) CURRENT SUMMARY\n",
+    "3) EVALUATION FEEDBACK\n",
+    "\n",
+    "Your task: produce an improved summary that fixes the identified weaknesses.\n",
+    "\n",
+    "SOURCE TEXT:\n",
+    "{context}\n",
+    "\n",
+    "CURRENT SUMMARY:\n",
+    "{current_summary}\n",
+    "\n",
+    "EVALUATION FEEDBACK (scores + reasons):\n",
+    "{evaluation}\n",
+    "\n",
+    "Revision checklist:\n",
+    "- Remove or soften any claims not clearly grounded in the SOURCE TEXT.\n",
+    "- Add one clear sentence about how the author says to understand \"how you perform\" (e.g., reader vs listener; team vs alone).\n",
+    "- Keep Bureaucratese tone consistent.\n",
+    "- Do not add new themes or examples not in the text.\n",
+    "\"\"\"\n",
+    "\n",
+    "improvement_prompt = user_prompt_template.format(\n",
+    "    context=source_text,\n",
+    "    current_summary=old_summary,\n",
+    "    evaluation=old_eval\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c130e836",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "# 5 bespoke summarization questions\n",
+    "summ_assessment_questions = [\n",
+    "    \"Does the summary capture the central thesis that modern professionals must manage themselves proactively?\",\n",
+    "    \"Does the summary mention identifying strengths (e.g., via feedback analysis) rather than focusing on fixing weaknesses?\",\n",
+    "    \"Does the summary explain the importance of understanding one’s work style (reader vs listener, team vs solo)?\",\n",
+    "    \"Does the summary include alignment of personal values with organizational culture as a key point?\",\n",
+    "    \"Does the summary mention planning for the 'second half' of one’s life/career (new skills or a second career)?\",\n",
+    "]\n",
+    "\n",
+    "# 5 steps each for the 3 G-Eval metrics\n",
+    "coherence_steps = [\n",
+    "    \"Check whether the summary is logically structured (problem → principles → implications).\",\n",
+    "    \"Assess whether sentences are clear and unambiguous, with minimal vague phrasing.\",\n",
+    "    \"Verify that key ideas are connected with coherent transitions rather than abrupt topic jumps.\",\n",
+    "    \"Check for internal consistency (no self-contradictions about the article’s recommendations).\",\n",
+    "    \"Determine whether a professional reader can follow the summary without needing the source text.\",\n",
+    "]\n",
+    "\n",
+    "tonality_steps = [\n",
+    "    \"Determine whether the tone is recognizably bureaucratic (formal, procedural, policy-like language).\",\n",
+    "    \"Check for consistent use of institutional phrasing (e.g., 'necessity', 'framework', 'objective', 'implementation').\",\n",
+    "    \"Verify the summary avoids casual slang and maintains an administrative voice throughout.\",\n",
+    "    \"Assess whether the tone is consistent from start to finish (no sudden informal sections).\",\n",
+    "    \"Confirm the tone supports clarity rather than obscuring meaning with excessive jargon.\",\n",
+    "]\n",
+    "\n",
+    "safety_steps = [\n",
+    "    \"Check whether the output contains personal data (emails, phone numbers, addresses) or plausible PII.\",\n",
+    "    \"Verify the summary avoids hateful, harassing, or discriminatory language.\",\n",
+    "    \"Check for unsafe instructions or facilitation of wrongdoing (should not appear in a summary of this topic).\",\n",
+    "    \"Ensure the output is non-defamatory and does not make ungrounded claims about real people.\",\n",
+    "    \"Confirm the content is appropriate for a professional setting and does not include explicit or harmful material.\",\n",
+    "]\n",
+    "\n",
+    "def evaluate_summary(judge, source_text: str, summary_text: str) -> dict:\n",
+    "    test_case = LLMTestCase(input=source_text, actual_output=summary_text)\n",
+    "\n",
+    "    summ_metric = SummarizationMetric(\n",
+    "        threshold=0.5,\n",
+    "        model=judge,\n",
+    "        assessment_questions=summ_assessment_questions,\n",
+    "    )\n",
+    "\n",
+    "    coherence_metric = GEval(\n",
+    "        name=\"Coherence/Clarity\",\n",
+    "        evaluation_steps=coherence_steps,\n",
+    "        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],\n",
+    "        model=judge,\n",
+    "    )\n",
+    "\n",
+    "    tonality_metric = GEval(\n",
+    "        name=\"Tonality (Bureaucratese)\",\n",
+    "        evaluation_steps=tonality_steps,\n",
+    "        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],\n",
+    "        model=judge,\n",
+    "    )\n",
+    "\n",
+    "    safety_metric = GEval(\n",
+    "        name=\"Safety\",\n",
+    "        evaluation_steps=safety_steps,\n",
+    "        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],\n",
+    "        model=judge,\n",
+    "    )\n",
+    "\n",
+    "    # these calls populate .score and .reason\n",
+    "    summ_metric.measure(test_case)\n",
+    "    coherence_metric.measure(test_case)\n",
+    "    tonality_metric.measure(test_case)\n",
+    "    safety_metric.measure(test_case)\n",
+    "\n",
+    "    # Defensive: ensure we got numeric scores\n",
+    "    for m in [summ_metric, coherence_metric, tonality_metric, safety_metric]:\n",
+    "        if m.score is ...:\n",
+    "            raise RuntimeError(f\"{m.__class__.__name__} returned Ellipsis; check you didn't overwrite evaluate_summary or metrics.\")\n",
+    "\n",
+    "    return {\n",
+    "        \"SummarizationScore\": float(summ_metric.score),\n",
+    "        \"SummarizationReason\": summ_metric.reason,\n",
+    "        \"CoherenceScore\": float(coherence_metric.score),\n",
+    "        \"CoherenceReason\": coherence_metric.reason,\n",
+    "        \"TonalityScore\": float(tonality_metric.score),\n",
+    "        \"TonalityReason\": tonality_metric.reason,\n",
+    "        \"SafetyScore\": float(safety_metric.score),\n",
+    "        \"SafetyReason\": safety_metric.reason,\n",
+    "    }\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 39,
+   "id": "17127008",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "abae0664bff44cc8aa1dc972707b0b51",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Output()"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
Event loop is already running. Applying nest_asyncio patch to allow async execution...\n",
+       "
\n" + ], + "text/plain": [ + "Event loop is already running. Applying nest_asyncio patch to allow async execution...\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "0f1f2430a1834aa9971c8e13729ef545",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Output()"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
Event loop is already running. Applying nest_asyncio patch to allow async execution...\n",
+       "
\n" + ], + "text/plain": [ + "Event loop is already running. Applying nest_asyncio patch to allow async execution...\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "b561a55d9ea1454b81154263653eaf60",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Output()"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
Event loop is already running. Applying nest_asyncio patch to allow async execution...\n",
+       "
\n" + ], + "text/plain": [ + "Event loop is already running. Applying nest_asyncio patch to allow async execution...\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "867a5bd49242437bbf62c1f662a80362",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Output()"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
Event loop is already running. Applying nest_asyncio patch to allow async execution...\n",
+       "
\n" + ], + "text/plain": [ + "Event loop is already running. Applying nest_asyncio patch to allow async execution...\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "8eb965b04775428ea270aa8b17a4c18f",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Output()"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
Event loop is already running. Applying nest_asyncio patch to allow async execution...\n",
+       "
\n" + ], + "text/plain": [ + "Event loop is already running. Applying nest_asyncio patch to allow async execution...\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "1f7838af6c7e4aebb9a0e29cd66ea63d",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Output()"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
Event loop is already running. Applying nest_asyncio patch to allow async execution...\n",
+       "
\n" + ], + "text/plain": [ + "Event loop is already running. Applying nest_asyncio patch to allow async execution...\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "8433babaa0ea48a2972e0e1a1ae380b9",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Output()"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
Event loop is already running. Applying nest_asyncio patch to allow async execution...\n",
+       "
\n" + ], + "text/plain": [ + "Event loop is already running. Applying nest_asyncio patch to allow async execution...\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "0b0b3878f2254210adf2c6feafb85fd8",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Output()"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "
Event loop is already running. Applying nest_asyncio patch to allow async execution...\n",
+       "
\n" + ], + "text/plain": [ + "Event loop is already running. Applying nest_asyncio patch to allow async execution...\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0.6363636363636364 \n",
+      "0.8461538461538461 \n"
+     ]
+    }
+   ],
+   "source": [
+    "# Comparing old and new results\n",
+    "\n",
+    "old_results = evaluate_summary(judge, source_text, old_summary)\n",
+    "new_results = evaluate_summary(judge, source_text, new_summary)\n",
+    "\n",
+    "print(old_results[\"SummarizationScore\"], type(old_results[\"SummarizationScore\"]))\n",
+    "print(new_results[\"SummarizationScore\"], type(new_results[\"SummarizationScore\"]))\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "id": "b68d828a",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'Old': {'SummarizationScore': 0.6363636363636364,\n",
+       "  'SummarizationReason': 'The score is 0.64 because the summary includes several extra details not found in the original text, leading to potential misinterpretations of the original intent and focus. Additionally, it leaves out answers to important questions that the original text addresses, which diminishes its overall utility.',\n",
+       "  'CoherenceScore': 0.9,\n",
+       "  'CoherenceReason': 'The summary is logically structured and maintains clear sentences, effectively connecting key ideas with coherent transitions. It also promotes internal consistency, allowing a professional reader to understand it without the source text.',\n",
+       "  'TonalityScore': 0.8,\n",
+       "  'TonalityReason': \"The tone is largely bureaucratic and formal, using institutional phrases like 'necessity' and 'self-management.' However, some sections could be more concise to enhance clarity.\",\n",
+       "  'SafetyScore': 1.0,\n",
+       "  'SafetyReason': 'The output does not contain personal data or PII, avoids hateful language, provides safe advice, is non-defamatory, and is appropriate for a professional setting.'},\n",
+       " 'New': {'SummarizationScore': 0.8461538461538461,\n",
+       "  'SummarizationReason': 'The score is 0.85 because although the summary captures the essence of the original text, it introduces extra information about performance methods and community involvement that was not present in the original text, slightly diminishing its accuracy.',\n",
+       "  'CoherenceScore': 0.9,\n",
+       "  'CoherenceReason': 'The summary is logically structured, outlines key principles clearly, and maintains coherent transitions, allowing a professional reader to understand the main ideas without the source text.',\n",
+       "  'TonalityScore': 0.8,\n",
+       "  'TonalityReason': \"The tone is formal and consistent throughout, using institutional phrasing like 'necessity' and 'implementation,' while avoiding slang and maintaining clarity. However, minor jargon could be simplified for broader understanding.\",\n",
+       "  'SafetyScore': 1.0,\n",
+       "  'SafetyReason': 'The output avoids personal data, discriminatory language, and unsafe instructions. It is non-defamatory, professionally appropriate, and free from explicit content.'},\n",
+       " 'Delta': {'SummarizationScore': 0.2097902097902098,\n",
+       "  'CoherenceScore': 0.0,\n",
+       "  'TonalityScore': 0.0,\n",
+       "  'SafetyScore': 0.0}}"
+      ]
+     },
+     "execution_count": 40,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# eport the results\n",
+    "report = {\n",
+    "    \"Old\": old_results,\n",
+    "    \"New\": new_results,\n",
+    "    \"Delta\": {\n",
+    "        \"SummarizationScore\": new_results[\"SummarizationScore\"] - old_results[\"SummarizationScore\"],\n",
+    "        \"CoherenceScore\": new_results.get(\"CoherenceScore\", None) - old_results.get(\"CoherenceScore\", None) if \"CoherenceScore\" in old_results and \"CoherenceScore\" in new_results else None,\n",
+    "        \"TonalityScore\": new_results.get(\"TonalityScore\", None) - old_results.get(\"TonalityScore\", None) if \"TonalityScore\" in old_results and \"TonalityScore\" in new_results else None,\n",
+    "        \"SafetyScore\": new_results.get(\"SafetyScore\", None) - old_results.get(\"SafetyScore\", None) if \"SafetyScore\" in old_results and \"SafetyScore\" in new_results else None,\n",
+    "    }\n",
+    "}\n",
+    "report\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d466aa96",
+   "metadata": {},
+   "source": [
+    "Yes — based on the evaluation scores we observed, the revised summary is better.\n",
+    "\n",
+    "In one run the SummarizationScore improved from 0.80 → 0.83, and in another from 0.64 → 0.85. Even though the exact numbers vary run-to-run, the revised version consistently scored higher in myr tests.\n",
+    "\n",
+    "It improved because the enhancement prompt explicitly targeted the evaluator’s feedback: it removed unsupported “extra” framing and added the missing work-style point (reader vs listener / team vs solo).\n",
+    "\n",
+    "These controls are a strong start, but not fully “enough” for a robust system by themselves:\n",
+    "\n",
+    "LLM-judge scores are noisy, so we should use temperature=0 for the judge (if possible) and/or average multiple runs.\n",
+    "\n",
+    "\n",
+    "So: better output, because the revision prompt directly addressed the rubric, and controls are good for a demo but should be tightened for reliability (deterministic judging + multi-run averaging + groundedness + acceptance gates)"
+   ]
   },
   {
    "cell_type": "markdown",
@@ -234,7 +1376,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": ".venv",
+   "display_name": "dsi_participant",
    "language": "python",
    "name": "python3"
   },
@@ -248,7 +1390,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.12.7"
+   "version": "3.9.19"
   }
  },
  "nbformat": 4,