Skip to content

Commit

Permalink
Merge pull request #840 from AishaDarga/hap_score
Browse files Browse the repository at this point in the history
  • Loading branch information
touma-I authored Dec 7, 2024
2 parents 5fbc6af + 06a2936 commit 2e2e5d4
Show file tree
Hide file tree
Showing 2 changed files with 456 additions and 0 deletions.
390 changes: 390 additions & 0 deletions examples/notebooks/hap/generate_hap_score_csv.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,390 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"HAP Transform Example Notebook\n",
"=====================================\n",
"\n",
"This notebook processes a CSV file containing text data to analyze for Hate, Abuse, and Profanity (HAP) scores.\n",
"It converts the CSV file into Parquet format, uses the `hap_local_python.py` script to calculate HAP scores, \n",
"and generates outputs for further analysis."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"### Overview\n",
"This notebook demonstrates the use of the HAP transformation to annotate documents with a `hap_score`, \n",
"indicating the likelihood of Hate, Abuse, or Profanity in the text.\n",
"\n",
"### Workflow\n",
"The HAP process consists of:\n",
"1. **Sentence Splitting**: Documents are split into sentences using NLTK.\n",
"2. **HAP Annotation**: Each sentence is scored between 0 and 1 (1 = high HAP, 0 = no HAP).\n",
"3. **Aggregation**: The document's final HAP score is the maximum score among all sentences.\n",
"\n",
"\n",
"### Configuration\n",
"- **Model Name**: IBM Granite Guardian (`ibm-granite/granite-guardian-hap-38m` by default).\n",
"- **Document Text Column** (`--doc_text_column`): Specify the input column containing document text to generate the hap_score against. Defaults to `contents`.\n",
"- **Annotation Column** (`--annotation_column`): Specify the output column for HAP scores. Defaults to `hap_score`.\n",
"\n",
"\n",
"### Steps in This Notebook\n",
"1. Define paths and import libraries.\n",
"2. Convert CSV input to Parquet.\n",
"3. Run the HAP transformation script.\n",
"4. View and analyze the results.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Install dependencies"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n",
"```\n",
"make venv \n",
"source venv/bin/activate \n",
"pip install jupyterlab\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! pip install 'data-prep-toolkit[ray]==0.2.2'\n",
"! pip install 'data-prep-toolkit-transforms[all]==0.2.2'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Import necessary libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import sys\n",
"import pandas as pd\n",
"import subprocess\n",
"\n",
"from data_processing.runtime.pure_python import PythonTransformLauncher\n",
"from data_processing.utils import ParamsUtils\n",
"from hap_transform_python import HAPPythonTransformConfiguration\n",
"from hap_transform import HAPTransform"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 1: Setup runtime parameters\n",
"\n",
"- input-folder: Path to the input data to be used by the transform.\n",
"- output-folder: Path where the output file with HAP scores will be saved.\n",
"- doc_text_column: The column containing the text for analysis (For ex.: `Customer Feedback`).\n",
"- annotation_column: The column where HAP scores will be saved (default: `hap_score`).\n",
"\n",
"**Customization**: \n",
"- Ensure the column containing the text matches the `doc_text_column` parameter.\n",
"- If your text column has a different name, update the value of `--doc_text_column` accordingly.\n",
"- You can adjust other parameters like `--batch_size` and `--max_length` if needed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"# create parameters\n",
"input_csv = './input-csv'\n",
"input_folder = './input-parquet'\n",
"output_folder = './output'\n",
"output_parquet_file = os.path.join(output_folder, \"customer-feedback.parquet\")\n",
"output_csv_file = os.path.join(output_folder, \"customer-feedback.csv\")\n",
"\n",
"local_conf = {\n",
" \"input_folder\": input_folder,\n",
" \"output_folder\": output_folder,\n",
"}\n",
"\n",
"print(input_folder)\n",
"code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n",
"params = {\n",
" # Data access. Only required parameters are specified\n",
" \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n",
" # execution info\n",
" \"runtime_pipeline_id\": \"pipeline_id\",\n",
" \"runtime_job_id\": \"job_id\",\n",
" \"runtime_code_location\": ParamsUtils.convert_to_ast(code_location),\n",
" # hap params\n",
" \"doc_text_column\": \"Customer Feedback\",\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 2: Setup Input Data\n",
"\n",
"- Place your CSV file in the `input_csv`.\n",
"- Ensure the column containing the text matches the `doc_text_column` parameter.\n",
"- If your text column has a different name, update the `doc_text_column` parameter in later cells."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Clear the existing input-parquet folder\n",
"if os.path.exists(input_folder):\n",
" for file_name in os.listdir(input_folder):\n",
" file_path = os.path.join(input_folder, file_name)\n",
" try:\n",
" os.remove(file_path)\n",
" print(f\"Deleted file: {file_path}\")\n",
" except Exception as e:\n",
" print(f\"Failed to delete {file_path}: {e}\")\n",
"else:\n",
" os.makedirs(input_csv, exist_ok=True) # Updated line\n",
" os.makedirs(input_folder, exist_ok=True) # Updated line\n",
" print(f\"Created folder: {input_csv}\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"csv_files = [f for f in os.listdir(input_csv) if f.endswith(\".csv\")]\n",
"\n",
"if not csv_files:\n",
" print(f\"No CSV files found in the input folder: {input_csv}\")\n",
" print(\"Please place a CSV file in the input folder and rerun this script.\")\n",
"else:\n",
" # Pick the first CSV file in the folder\n",
" csv_file_path = os.path.join(input_csv, csv_files[0])\n",
" print(f\"Using CSV file: {csv_file_path}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 3: Convert CSV to Parquet\n",
"Convert the selected CSV file to Parquet format."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"parquet_file_path = os.path.join(input_folder, \"customer-feedback.parquet\")\n",
"df = pd.read_csv(csv_file_path)\n",
"df.to_parquet(parquet_file_path, index=False)\n",
"print(f\"CSV file converted to Parquet format at: {parquet_file_path}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 4: Invoke the transform\n",
"Use python runtime to invoke the transform"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"sys.argv = ParamsUtils.dict_to_req(d=params)\n",
"launcher = PythonTransformLauncher(runtime_config=HAPPythonTransformConfiguration())\n",
"launcher.launch()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#The specified folder will include the transformed parquet files.\n",
"\n",
"import glob\n",
"glob.glob(\"python/output/*\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 5: Display Output in a Readable Format\n",
"\n",
"This step checks for any existing CSV files in the output folder and removes them before generating new ones. The following actions are performed:\n",
"\n",
"1. **Listing Output Files**: The script lists all files in the output folder.\n",
"2. **Check for Parquet Files**: It identifies `.parquet` files in the output folder.\n",
"3. **Remove Old CSV Files**: If any previous output files (`hap_complete_output.csv` or `hap_filtered_output.csv`) exist, they are deleted.\n",
"4. **Read Parquet File**: The Parquet file is read into a DataFrame.\n",
"5. **Filter Data**: The relevant columns, `doc_text_column` (from the environment variable) and `hap_score_column`, are selected from the DataFrame.\n",
"6. **CSV Output**: Convert the parquet output to CSV \n",
"7. **Display Output**: Display Output in the notebook for a quick reference \n",
"\n",
"Example Output:\n",
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"</style>\n",
"<table border=\"0\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: left;\">\n",
" <th></th>\n",
" <th>Customer Feedback</th>\n",
" <th>hap_score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Rating: 4 Comments: \"Service was prompt, but ...</td>\n",
" <td>0.000195</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Rating: 5 Comments: \"Great help from Peter! H...</td>\n",
" <td>0.000153</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Rating: 3 Comments: \"The service was quick, b...</td>\n",
" <td>0.000169</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Rating: 5 Comments: \"Excellent service and ad...</td>\n",
" <td>0.000158</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Locate transformed Parquet files in the output folder\n",
"output_parquet_files = [f for f in os.listdir(output_folder) if f.endswith(\".parquet\")]\n",
"\n",
"if output_parquet_files:\n",
" # Clear existing CSV files in the output folder\n",
" for file_name in os.listdir(output_folder):\n",
" if file_name.endswith(\".csv\"):\n",
" file_path = os.path.join(output_folder, file_name)\n",
" try:\n",
" os.remove(file_path)\n",
" print(f\"Deleted old CSV file: {file_path}\")\n",
" except Exception as e:\n",
" print(f\"Failed to delete {file_path}: {e}\")\n",
"\n",
" for output_parquet in output_parquet_files:\n",
" original_parquet_path = os.path.join(output_folder, output_parquet)\n",
" try:\n",
" # Rename the Parquet file\n",
" os.rename(original_parquet_path, output_parquet_file)\n",
" print(f\"Renamed Parquet file to: {output_parquet_file}\")\n",
"\n",
" # Convert the renamed Parquet file to CSV\n",
" transformed_df = pd.read_parquet(output_parquet_file)\n",
" transformed_df.to_csv(output_csv_file, index=False)\n",
" print(f\"Transformed CSV file saved at: {output_csv_file}\")\n",
"\n",
" # Display selected columns in tabular format\n",
" if 'Customer Feedback' in transformed_df.columns and 'hap_score' in transformed_df.columns:\n",
" display_df = transformed_df[['Customer Feedback', 'hap_score']]\n",
" print(\"\\nSelected Columns (Customer Feedback and HAP Score):\")\n",
" from IPython.display import display # Ensure pretty display in Jupyter\n",
" display(display_df.head(10)) # Display the first 10 rows\n",
" else:\n",
" print(\"The required columns ('Customer Feedback' and 'hap_score') are not in the transformed data.\")\n",
"\n",
" except Exception as e:\n",
" print(f\"Error processing files: {e}\")\n",
"else:\n",
" print(f\"No Parquet files found in the output folder: {output_folder}\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "dataprepkit",
"language": "python",
"name": "data-prep-kit"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.10"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading

0 comments on commit 2e2e5d4

Please sign in to comment.