Merge pull request #840 from AishaDarga/hap_score

IBM · Dec 7, 2024 · 2e2e5d4 · 2e2e5d4
2 parents 5fbc6af + 06a2936
commit 2e2e5d4
Show file tree

Hide file tree

Showing 2 changed files with 456 additions and 0 deletions.
diff --git a/examples/notebooks/hap/generate_hap_score_csv.ipynb b/examples/notebooks/hap/generate_hap_score_csv.ipynb
@@ -0,0 +1,390 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "HAP Transform Example Notebook\n",
+    "=====================================\n",
+    "\n",
+    "This notebook processes a CSV file containing text data to analyze for Hate, Abuse, and Profanity (HAP) scores.\n",
+    "It converts the CSV file into Parquet format, uses the `hap_local_python.py` script to calculate HAP scores, \n",
+    "and generates outputs for further analysis."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "\n",
+    "### Overview\n",
+    "This notebook demonstrates the use of the HAP transformation to annotate documents with a `hap_score`, \n",
+    "indicating the likelihood of Hate, Abuse, or Profanity in the text.\n",
+    "\n",
+    "### Workflow\n",
+    "The HAP process consists of:\n",
+    "1. **Sentence Splitting**: Documents are split into sentences using NLTK.\n",
+    "2. **HAP Annotation**: Each sentence is scored between 0 and 1 (1 = high HAP, 0 = no HAP).\n",
+    "3. **Aggregation**: The document's final HAP score is the maximum score among all sentences.\n",
+    "\n",
+    "\n",
+    "### Configuration\n",
+    "- **Model Name**: IBM Granite Guardian (`ibm-granite/granite-guardian-hap-38m` by default).\n",
+    "- **Document Text Column** (`--doc_text_column`): Specify the input column containing document text to generate the hap_score against. Defaults to `contents`.\n",
+    "- **Annotation Column** (`--annotation_column`): Specify the output column for HAP scores. Defaults to `hap_score`.\n",
+    "\n",
+    "\n",
+    "### Steps in This Notebook\n",
+    "1. Define paths and import libraries.\n",
+    "2. Convert CSV input to Parquet.\n",
+    "3. Run the HAP transformation script.\n",
+    "4. View and analyze the results.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Install dependencies"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n",
+    "```\n",
+    "make venv \n",
+    "source venv/bin/activate \n",
+    "pip install jupyterlab\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! pip install  'data-prep-toolkit[ray]==0.2.2'\n",
+    "! pip install  'data-prep-toolkit-transforms[all]==0.2.2'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Import necessary libraries"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import sys\n",
+    "import pandas as pd\n",
+    "import subprocess\n",
+    "\n",
+    "from data_processing.runtime.pure_python import PythonTransformLauncher\n",
+    "from data_processing.utils import ParamsUtils\n",
+    "from hap_transform_python import HAPPythonTransformConfiguration\n",
+    "from hap_transform import HAPTransform"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 1: Setup runtime parameters\n",
+    "\n",
+    "- input-folder: Path to the input data to be used by the transform.\n",
+    "- output-folder: Path where the output file with HAP scores will be saved.\n",
+    "- doc_text_column: The column containing the text for analysis (For ex.: `Customer Feedback`).\n",
+    "- annotation_column: The column where HAP scores will be saved (default: `hap_score`).\n",
+    "\n",
+    "**Customization**: \n",
+    "- Ensure the column containing the text matches the `doc_text_column` parameter.\n",
+    "- If your text column has a different name, update the value of `--doc_text_column` accordingly.\n",
+    "- You can adjust other parameters like `--batch_size` and `--max_length` if needed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "# create parameters\n",
+    "input_csv = './input-csv'\n",
+    "input_folder = './input-parquet'\n",
+    "output_folder = './output'\n",
+    "output_parquet_file = os.path.join(output_folder, \"customer-feedback.parquet\")\n",
+    "output_csv_file = os.path.join(output_folder, \"customer-feedback.csv\")\n",
+    "\n",
+    "local_conf = {\n",
+    "    \"input_folder\": input_folder,\n",
+    "    \"output_folder\": output_folder,\n",
+    "}\n",
+    "\n",
+    "print(input_folder)\n",
+    "code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n",
+    "params = {\n",
+    "    # Data access. Only required parameters are specified\n",
+    "    \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n",
+    "    # execution info\n",
+    "    \"runtime_pipeline_id\": \"pipeline_id\",\n",
+    "    \"runtime_job_id\": \"job_id\",\n",
+    "    \"runtime_code_location\": ParamsUtils.convert_to_ast(code_location),\n",
+    "    # hap params\n",
+    "    \"doc_text_column\": \"Customer Feedback\",\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 2: Setup Input Data\n",
+    "\n",
+    "- Place your CSV file in the `input_csv`.\n",
+    "- Ensure the column containing the text matches the `doc_text_column` parameter.\n",
+    "- If your text column has a different name, update the `doc_text_column` parameter in later cells."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Clear the existing input-parquet folder\n",
+    "if os.path.exists(input_folder):\n",
+    "    for file_name in os.listdir(input_folder):\n",
+    "        file_path = os.path.join(input_folder, file_name)\n",
+    "        try:\n",
+    "            os.remove(file_path)\n",
+    "            print(f\"Deleted file: {file_path}\")\n",
+    "        except Exception as e:\n",
+    "            print(f\"Failed to delete {file_path}: {e}\")\n",
+    "else:\n",
+    "    os.makedirs(input_csv, exist_ok=True)  # Updated line\n",
+    "    os.makedirs(input_folder, exist_ok=True)  # Updated line\n",
+    "    print(f\"Created folder: {input_csv}\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "csv_files = [f for f in os.listdir(input_csv) if f.endswith(\".csv\")]\n",
+    "\n",
+    "if not csv_files:\n",
+    "    print(f\"No CSV files found in the input folder: {input_csv}\")\n",
+    "    print(\"Please place a CSV file in the input folder and rerun this script.\")\n",
+    "else:\n",
+    "    # Pick the first CSV file in the folder\n",
+    "    csv_file_path = os.path.join(input_csv, csv_files[0])\n",
+    "    print(f\"Using CSV file: {csv_file_path}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 3: Convert CSV to Parquet\n",
+    "Convert the selected CSV file to Parquet format."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "parquet_file_path = os.path.join(input_folder, \"customer-feedback.parquet\")\n",
+    "df = pd.read_csv(csv_file_path)\n",
+    "df.to_parquet(parquet_file_path, index=False)\n",
+    "print(f\"CSV file converted to Parquet format at: {parquet_file_path}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 4: Invoke the transform\n",
+    "Use python runtime to invoke the transform"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "sys.argv = ParamsUtils.dict_to_req(d=params)\n",
+    "launcher = PythonTransformLauncher(runtime_config=HAPPythonTransformConfiguration())\n",
+    "launcher.launch()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#The specified folder will include the transformed parquet files.\n",
+    "\n",
+    "import glob\n",
+    "glob.glob(\"python/output/*\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 5: Display Output in a Readable Format\n",
+    "\n",
+    "This step checks for any existing CSV files in the output folder and removes them before generating new ones. The following actions are performed:\n",
+    "\n",
+    "1. **Listing Output Files**: The script lists all files in the output folder.\n",
+    "2. **Check for Parquet Files**: It identifies `.parquet` files in the output folder.\n",
+    "3. **Remove Old CSV Files**: If any previous output files (`hap_complete_output.csv` or `hap_filtered_output.csv`) exist, they are deleted.\n",
+    "4. **Read Parquet File**: The Parquet file is read into a DataFrame.\n",
+    "5. **Filter Data**: The relevant columns, `doc_text_column` (from the environment variable) and `hap_score_column`, are selected from the DataFrame.\n",
+    "6. **CSV Output**: Convert the parquet output to CSV \n",
+    "7. **Display Output**: Display Output in the notebook for a quick reference \n",
+    "\n",
+    "Example Output:\n",
+    "<div>\n",
+    "<style scoped>\n",
+    "    .dataframe tbody tr th:only-of-type {\n",
+    "        vertical-align: middle;\n",
+    "    }\n",
+    "\n",
+    "    .dataframe tbody tr th {\n",
+    "        vertical-align: top;\n",
+    "    }\n",
+    "\n",
+    "    .dataframe thead th {\n",
+    "        text-align: left;\n",
+    "    }\n",
+    "</style>\n",
+    "<table border=\"0\" class=\"dataframe\">\n",
+    "  <thead>\n",
+    "    <tr style=\"text-align: left;\">\n",
+    "      <th></th>\n",
+    "      <th>Customer Feedback</th>\n",
+    "      <th>hap_score</th>\n",
+    "    </tr>\n",
+    "  </thead>\n",
+    "  <tbody>\n",
+    "    <tr>\n",
+    "      <th>0</th>\n",
+    "      <td>Rating: 4 Comments: \"Service was prompt, but ...</td>\n",
+    "      <td>0.000195</td>\n",
+    "    </tr>\n",
+    "    <tr>\n",
+    "      <th>1</th>\n",
+    "      <td>Rating: 5 Comments: \"Great help from Peter! H...</td>\n",
+    "      <td>0.000153</td>\n",
+    "    </tr>\n",
+    "    <tr>\n",
+    "      <th>2</th>\n",
+    "      <td>Rating: 3 Comments: \"The service was quick, b...</td>\n",
+    "      <td>0.000169</td>\n",
+    "    </tr>\n",
+    "    <tr>\n",
+    "      <th>3</th>\n",
+    "      <td>Rating: 5 Comments: \"Excellent service and ad...</td>\n",
+    "      <td>0.000158</td>\n",
+    "    </tr>\n",
+    "  </tbody>\n",
+    "</table>\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Locate transformed Parquet files in the output folder\n",
+    "output_parquet_files = [f for f in os.listdir(output_folder) if f.endswith(\".parquet\")]\n",
+    "\n",
+    "if output_parquet_files:\n",
+    "    # Clear existing CSV files in the output folder\n",
+    "    for file_name in os.listdir(output_folder):\n",
+    "        if file_name.endswith(\".csv\"):\n",
+    "            file_path = os.path.join(output_folder, file_name)\n",
+    "            try:\n",
+    "                os.remove(file_path)\n",
+    "                print(f\"Deleted old CSV file: {file_path}\")\n",
+    "            except Exception as e:\n",
+    "                print(f\"Failed to delete {file_path}: {e}\")\n",
+    "\n",
+    "    for output_parquet in output_parquet_files:\n",
+    "        original_parquet_path = os.path.join(output_folder, output_parquet)\n",
+    "        try:\n",
+    "            # Rename the Parquet file\n",
+    "            os.rename(original_parquet_path, output_parquet_file)\n",
+    "            print(f\"Renamed Parquet file to: {output_parquet_file}\")\n",
+    "\n",
+    "            # Convert the renamed Parquet file to CSV\n",
+    "            transformed_df = pd.read_parquet(output_parquet_file)\n",
+    "            transformed_df.to_csv(output_csv_file, index=False)\n",
+    "            print(f\"Transformed CSV file saved at: {output_csv_file}\")\n",
+    "\n",
+    "            # Display selected columns in tabular format\n",
+    "            if 'Customer Feedback' in transformed_df.columns and 'hap_score' in transformed_df.columns:\n",
+    "                display_df = transformed_df[['Customer Feedback', 'hap_score']]\n",
+    "                print(\"\\nSelected Columns (Customer Feedback and HAP Score):\")\n",
+    "                from IPython.display import display  # Ensure pretty display in Jupyter\n",
+    "                display(display_df.head(10))  # Display the first 10 rows\n",
+    "            else:\n",
+    "                print(\"The required columns ('Customer Feedback' and 'hap_score') are not in the transformed data.\")\n",
+    "\n",
+    "        except Exception as e:\n",
+    "            print(f\"Error processing files: {e}\")\n",
+    "else:\n",
+    "    print(f\"No Parquet files found in the output folder: {output_folder}\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "dataprepkit",
+   "language": "python",
+   "name": "data-prep-kit"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}