-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HAP transform: Update README.md and add sample notebook #821
Merged
Merged
Changes from 3 commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
569b080
Update README.md
ian-cho ddfa7b8
Add files via upload
ian-cho dea69e2
Update README.md
ian-cho 8e3f3bd
Add files via upload
ian-cho 083c230
Update README.md
ian-cho 2a374d1
fix layout for commands in first cell
touma-I 7102f05
Merge branch 'dev' into ian-cho-patch-1
touma-I File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,45 +1,53 @@ | ||
# Hate, Abuse, and Profanity (HAP) Annotation | ||
Please see the set of [transform project conventions](https://github.com/ian-cho/data-prep-kit/blob/dev/transforms/README.md) for details on general project conventions, transform configuration, testing and IDE set up. | ||
|
||
## Prerequisite | ||
## Contributor | ||
- Yang Zhao ([email protected]) | ||
|
||
## Description | ||
### Prerequisite | ||
This repository needs [NLTK](https://www.nltk.org/) and please refer to `requirements.txt`. | ||
|
||
## Summary | ||
### Overview | ||
The hap transform maps a non-empty input table to an output table with an added `hap_score` column. Each row in the table represents a document, and the hap transform performs the following three steps to calculate the hap score for each document: | ||
|
||
* Sentence spliting: we use NLTK to split the document into sentence pieces. | ||
* hap annotation: each sentence is assigned a hap score between 0 and 1, where 1 represents hap and 0 represents non-hap. | ||
* Aggregation: the document hap score is determined by selecting the maximum hap score among its sentences. | ||
|
||
|
||
## Configuration and command line Options | ||
The set of dictionary keys holding [HAPTransformConfiguration](src/hap_transform.py) | ||
configuration for values are as follows: | ||
|
||
* --model_name_or_path - specify the HAP model, which should be compatible with HuggingFace's AutoModelForSequenceClassification. Defaults to IBM's open-source toxicity classifier `ibm-granite/granite-guardian-hap-38m`. | ||
* --batch_size - modify it based on the infrastructure capacity. Defaults to `128`. | ||
* --max_length - the maximum length for the tokenizer. Defaults to `512`. | ||
* --doc_text_column - the column name containing the document text in the input .parquet file. Defaults to `contents`. | ||
* --annotation_column - the column name containing hap (toxicity) score in the output .parquet file. Defaults to `hap_score`. | ||
|
||
|
||
## input format | ||
### input format | ||
The input is in .parquet format and contains the following columns: | ||
|
||
| doc_id | contents | | ||
|:------:|:------:| | ||
| 1 | GSC is very much a little Swiss Army knife for... | | ||
| 2 | Here are only a few examples. And no, I'm not ... | | ||
|
||
## output format | ||
|
||
### output format | ||
The output is in .parquet format and includes an additional column, in addition to those in the input: | ||
|
||
| doc_id | contents | hap_score | | ||
|:------:|:------:|:-------------:| | ||
| 1 | GSC is very much a little Swiss Army knife for... | 0.002463 | | ||
| 2 | Here are only a few examples. And no, I'm not ... | 0.989713 | | ||
|
||
## How to run | ||
## Configuration | ||
The set of dictionary keys holding [HAPTransformConfiguration](src/hap_transform.py) | ||
configuration for values are as follows: | ||
|
||
|
||
* --model_name_or_path - specify the HAP model, which should be compatible with HuggingFace's AutoModelForSequenceClassification. Defaults to IBM's open-source toxicity classifier `ibm-granite/granite-guardian-hap-38m`. | ||
* --batch_size - modify it based on the infrastructure capacity. Defaults to `128`. | ||
* --max_length - the maximum length for the tokenizer. Defaults to `512`. | ||
* --doc_text_column - the column name containing the document text in the input .parquet file. Defaults to `contents`. | ||
* --annotation_column - the column name containing hap (toxicity) score in the output .parquet file. Defaults to `hap_score`. | ||
|
||
|
||
|
||
|
||
## Usage | ||
Place your input Parquet file in the `test-data/input/` directory. A sample file, `test1.parquet`, is available in this directory. Once done, run the script. | ||
|
||
```python | ||
|
@@ -48,6 +56,20 @@ python hap_local_python.py | |
|
||
You will obtain the output file `test1.parquet` in the output directory. | ||
|
||
### Code example | ||
[notebook](../hap_python.ipynb) | ||
|
||
### Transforming data using the transform image | ||
To use the transform image to transform your data, please refer to the | ||
[running images quickstart](../../../../doc/quick-start/run-transform-image.md), | ||
substituting the name of this transform image and runtime as appropriate. | ||
|
||
## Testing | ||
|
||
Currently we have: | ||
- [hap test](transforms/universal/hap/python/test/test_hap.py) | ||
|
||
|
||
## Throughput | ||
The table below shows the throughput (tokens per second) of the HAP transform module, which primarily includes sentence splitting, HAP annotation, and HAP score aggregation. We herein compare two models: | ||
|
||
|
@@ -62,6 +84,7 @@ We processed 6,000 documents (12 MB in Parquet file size) using the HAP transfor | |
| granite-guardian-hap-125m | 1.14 k | | ||
|
||
|
||
|
||
### Credits | ||
The HAP transform is jointly developed by IBM Research - Tokyo and Yorktown. | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,158 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"id": "38aebf49-9460-4951-bb04-7045dec28690", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"[nltk_data] Downloading package punkt_tab to /Users/ian/nltk_data...\n", | ||
"[nltk_data] Package punkt_tab is already up-to-date!\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# import necessary packages\n", | ||
"import ast\n", | ||
"import os\n", | ||
"import sys\n", | ||
"\n", | ||
"from data_processing.runtime.pure_python import PythonTransformLauncher\n", | ||
"from data_processing.utils import ParamsUtils\n", | ||
"from hap_transform_python import HAPPythonTransformConfiguration" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"id": "6a8ec5e4-1f52-4c61-9c9e-4618f9034b80", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# create parameters\n", | ||
"__file__ = os.getcwd()\n", | ||
"input_folder = os.path.abspath(os.path.join(os.path.dirname(__file__), \"../test-data/input\"))\n", | ||
"output_folder = os.path.abspath(os.path.join(os.path.dirname(__file__), \"../output\"))\n", | ||
"local_conf = {\n", | ||
" \"input_folder\": input_folder,\n", | ||
" \"output_folder\": output_folder,\n", | ||
"}\n", | ||
"code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n", | ||
"\n", | ||
"params = {\n", | ||
" \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", | ||
" \"runtime_pipeline_id\": \"pipeline_id\",\n", | ||
" \"runtime_job_id\": \"job_id\",\n", | ||
" \"runtime_code_location\": ParamsUtils.convert_to_ast(code_location),\n", | ||
"}\n", | ||
"\n", | ||
"\n", | ||
"hap_params = {\n", | ||
" \"model_name_or_path\": 'ibm-granite/granite-guardian-hap-38m',\n", | ||
" \"annotation_column\": \"hap_score\",\n", | ||
" \"doc_text_column\": \"contents\",\n", | ||
" \"inference_engine\": \"CPU\",\n", | ||
" \"max_length\": 512,\n", | ||
" \"batch_size\": 128,\n", | ||
"}" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"id": "94e908e2-1891-4dc7-9f85-85bbf8d44c5e", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"22:40:12 INFO - hap params are {'model_name_or_path': 'ibm-granite/granite-guardian-hap-38m', 'annotation_column': 'hap_score', 'doc_text_column': 'contents', 'inference_engine': 'CPU', 'max_length': 512, 'batch_size': 128} \n", | ||
"22:40:12 INFO - pipeline id pipeline_id\n", | ||
"22:40:12 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}\n", | ||
"22:40:12 INFO - data factory data_ is using local data access: input_folder - /Users/ian/Desktop/data-prep-kit/transforms/universal/hap/test-data/input output_folder - /Users/ian/Desktop/data-prep-kit/transforms/universal/hap/output\n", | ||
"22:40:12 INFO - data factory data_ max_files -1, n_sample -1\n", | ||
"22:40:12 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", | ||
"22:40:12 INFO - orchestrator hap started at 2024-12-02 22:40:12\n", | ||
"22:40:12 ERROR - No input files to process - exiting\n", | ||
"22:40:12 INFO - Completed execution in 0.0 min, execution result 0\n" | ||
] | ||
}, | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"0" | ||
] | ||
}, | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"# Set the simulated command line args\n", | ||
"sys.argv = ParamsUtils.dict_to_req(d=params | hap_params)\n", | ||
"# create launcher\n", | ||
"launcher = PythonTransformLauncher(runtime_config=HAPPythonTransformConfiguration())\n", | ||
"# Launch to process the input\n", | ||
"launcher.launch()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"id": "f21d5d9b-562d-4530-8cea-2de5b63eb1dc", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"['../output/metadata.json', '../output/test1.parquet']" | ||
] | ||
}, | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"# the outputs will be located in the following folders\n", | ||
"import glob\n", | ||
"glob.glob(\"../output/*\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "2cd3367a-205f-4d33-83fb-106e32173bc0", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.0" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link to the notebook is broken. The correct link is: ./hap_python.ipynb, because the notebook is in the python directory.