Skip to content

Commit

Permalink
Merge branch 'dev' into transforms-0.2.3.dev1
Browse files Browse the repository at this point in the history
  • Loading branch information
touma-I committed Dec 3, 2024
2 parents 8c70ce3 + 8987261 commit b2625d0
Show file tree
Hide file tree
Showing 3 changed files with 286 additions and 22 deletions.
34 changes: 29 additions & 5 deletions resources.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# New Features & Enhancements

- Support for Docling 2.0 added to DPK in [pdf2parquet](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/pdf2parquet/python) transform. The new updates allow DPK users to ingest other type of documents, e.g. MS Word, MS Powerpoint, Images, Markdown, Asciidocs, etc.
- Released [Web2parquet](https://github.com/IBM/data-prep-kit/tree/dev/transforms/universal/web2parquet) transform for crawling the web.

# Data Prep Kit Resources

## 📄 Papers
Expand All @@ -7,24 +12,43 @@
3. [Scaling Granite Code Models to 128K Context](https://arxiv.org/abs/2407.13739)


## 🎤 Talks
## 🎤 External Events and Showcase

1. **"Building Successful LLM Apps: The Power of high quality data"** - [Video](https://www.youtube.com/watch?v=u_2uiZBBVIE) | [Slides](https://www.slideshare.net/slideshow/data_prep_techniques_challenges_methods-pdf-a190/271527890)
2. **"Hands on session for fine tuning LLMs"** - [Video](https://www.youtube.com/watch?v=VEHIA3E64DM)
3. **"Build your own data preparation module using data-prep-kit"** - [Video](https://www.youtube.com/watch?v=0WUMG6HIgMg)
4. **"Data Prep Kit: A Comprehensive Cloud-Native Toolkit for Scalable Data Preparation in GenAI App"** - [Video](https://www.youtube.com/watch?v=WJ147TGULwo) | [Slides](https://ossaidevjapan24.sched.com/event/1jKBm)
5. **"RAG with Data Prep Kit" Workshop** @ Mountain View, CA, USA ** - [info](https://github.com/sujee/data-prep-kit-examples/blob/main/events/2024-09-21__RAG-workshop-data-riders.md)
6. **Tech Educator summit** [IBM CSR Event](https://www.linkedin.com/posts/aanchalaggarwal_github-ibmdata-prep-kit-open-source-project-activity-7254062098295472128-OA_x?utm_source=share&utm_medium=member_desktop)
7. **Talk and Hands on session** at [MIT Bangalore](https://www.linkedin.com/posts/saptha-surendran-71a4a0ab_ibmresearch-dataprepkit-llms-activity-7261987741087801346-h0no?utm_source=share&utm_medium=member_desktop)
8. **PyData NYC 2024** - [90 mins Tutorial](https://nyc2024.pydata.org/cfp/talk/AWLTZP/)
9. **Open Source AI** [Demo Night](https://lu.ma/oss-ai?tk=A8BgIt)
10. [**Data Exchange Podcast with Ben Lorica**](https://thedataexchange.media/ibm-data-prep-kit/)
11. Unstructured Data Meetup - SF, NYC, Silicon Valley
12. IBM TechXchange Las Vegas
13. Open Source [**RAG Pipeline workshop**](https://www.linkedin.com/posts/sujeemaniyam_dataprepkit-workshop-llm-activity-7256176802383986688-2UKc?utm_source=share&utm_medium=member_desktop) with Data Prep Kit at TechEquity's AI Summit in Silicon Valley
14. **Data Science Dojo Meetup** - [video](https://datasciencedojo.com/tutorial/data-preparation-toolkit/)
15. [**DPK tutorial and hands on session at IIIT Delhi**](https://www.linkedin.com/posts/cai-iiitd-97a6a4232_datascience-datapipelines-machinelearning-activity-7263121565125349376-FG8E?utm_source=share&utm_medium=member_desktop)


## Example Code
Find example code in readme section of each tranform and some sample jupyter notebooks for getting started [**here**](examples/notebooks)

## Blogs / Tutorials

- [**IBM Developer Blog**](https://developer.ibm.com/blogs/awb-unleash-potential-llms-data-prep-kit/)
- [**Introductory Blog on DPK**](https://www.linkedin.com/pulse/unleashing-potential-large-language-models-through-data-aanchal-goyal-fgtff)
- [**DPK Header Cleanser Module Blog by external contributor**](https://www.linkedin.com/pulse/enhancing-data-quality-developing-header-cleansing-tool-kalathiya-i1ohc/?trackingId=6iAeBkBBRrOLijg3LTzIGA%3D%3D)


## Workshops
# Relevant online communities

- **2024-09-21: "RAG with Data Prep Kit" Workshop** @ Mountain View, CA, USA - [info](https://github.com/sujee/data-prep-kit-examples/blob/main/events/2024-09-21__RAG-workshop-data-riders.md)
- [**Data Prep Kit Discord Channel**](https://discord.com/channels/1276554812359442504/1303454647427661866)
- [**DPK is now listed in Github Awesome-LLM under LLM Data section**](https://github.com/Hannibal046/Awesome-LLM)
- [**DPK is now up for access via IBM Skills Build Download**](https://academic.ibm.com/a2mt/downloads/artificial_intelligence#/)
- [**DPK added to the Application Hub of “AI Sustainability Catalog”**](https://enterprise-neurosystem.github.io/Sustainability-Catalog/)

## Discord
## We Want Your Feedback!
Feel free to contribute to discussions or create a new one to share your [feedback](https://github.com/IBM/data-prep-kit/discussions)

- [**Data Prep Kit Discord Channel**](https://discord.com/channels/1276554812359442504/1286046139921207476)

57 changes: 40 additions & 17 deletions transforms/universal/hap/python/README.md
Original file line number Diff line number Diff line change
@@ -1,45 +1,53 @@
# Hate, Abuse, and Profanity (HAP) Annotation
Please see the set of [transform project conventions](https://github.com/ian-cho/data-prep-kit/blob/dev/transforms/README.md) for details on general project conventions, transform configuration, testing and IDE set up.

## Prerequisite
## Contributor
- Yang Zhao ([email protected])

## Description
### Prerequisite
This repository needs [NLTK](https://www.nltk.org/) and please refer to `requirements.txt`.

## Summary
### Overview
The hap transform maps a non-empty input table to an output table with an added `hap_score` column. Each row in the table represents a document, and the hap transform performs the following three steps to calculate the hap score for each document:

* Sentence spliting: we use NLTK to split the document into sentence pieces.
* hap annotation: each sentence is assigned a hap score between 0 and 1, where 1 represents hap and 0 represents non-hap.
* Aggregation: the document hap score is determined by selecting the maximum hap score among its sentences.


## Configuration and command line Options
The set of dictionary keys holding [HAPTransformConfiguration](src/hap_transform.py)
configuration for values are as follows:

* --model_name_or_path - specify the HAP model, which should be compatible with HuggingFace's AutoModelForSequenceClassification. Defaults to IBM's open-source toxicity classifier `ibm-granite/granite-guardian-hap-38m`.
* --batch_size - modify it based on the infrastructure capacity. Defaults to `128`.
* --max_length - the maximum length for the tokenizer. Defaults to `512`.
* --doc_text_column - the column name containing the document text in the input .parquet file. Defaults to `contents`.
* --annotation_column - the column name containing hap (toxicity) score in the output .parquet file. Defaults to `hap_score`.


## input format
### input format
The input is in .parquet format and contains the following columns:

| doc_id | contents |
|:------:|:------:|
| 1 | GSC is very much a little Swiss Army knife for... |
| 2 | Here are only a few examples. And no, I'm not ... |

## output format

### output format
The output is in .parquet format and includes an additional column, in addition to those in the input:

| doc_id | contents | hap_score |
|:------:|:------:|:-------------:|
| 1 | GSC is very much a little Swiss Army knife for... | 0.002463 |
| 2 | Here are only a few examples. And no, I'm not ... | 0.989713 |

## How to run
## Configuration
The set of dictionary keys holding [HAPTransformConfiguration](src/hap_transform.py)
configuration for values are as follows:


* --model_name_or_path - specify the HAP model, which should be compatible with HuggingFace's AutoModelForSequenceClassification. Defaults to IBM's open-source toxicity classifier `ibm-granite/granite-guardian-hap-38m`.
* --batch_size - modify it based on the infrastructure capacity. Defaults to `128`.
* --max_length - the maximum length for the tokenizer. Defaults to `512`.
* --doc_text_column - the column name containing the document text in the input .parquet file. Defaults to `contents`.
* --annotation_column - the column name containing hap (toxicity) score in the output .parquet file. Defaults to `hap_score`.




## Usage
Place your input Parquet file in the `test-data/input/` directory. A sample file, `test1.parquet`, is available in this directory. Once done, run the script.

```python
Expand All @@ -48,6 +56,20 @@ python hap_local_python.py

You will obtain the output file `test1.parquet` in the output directory.

### Code example
[notebook](./hap_python.ipynb)

### Transforming data using the transform image
To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.

## Testing

Currently we have:
- [hap test](transforms/universal/hap/python/test/test_hap.py)


## Throughput
The table below shows the throughput (tokens per second) of the HAP transform module, which primarily includes sentence splitting, HAP annotation, and HAP score aggregation. We herein compare two models:

Expand All @@ -62,6 +84,7 @@ We processed 6,000 documents (12 MB in Parquet file size) using the HAP transfor
| granite-guardian-hap-125m | 1.14 k |



### Credits
The HAP transform is jointly developed by IBM Research - Tokyo and Yorktown.


217 changes: 217 additions & 0 deletions transforms/universal/hap/python/hap_python.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "cefa9cf6-e043-4b75-b416-a0b26c8cb3ad",
"metadata": {},
"source": [
"**** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n",
"```\n",
" make venv \n",
" source venv/bin/activate \n",
" pip install jupyterlab\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "4a84e965-feeb-424d-9263-9f127e53a1aa",
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"## This is here as a reference only\n",
"# Users and application developers must use the right tag for the latest from pypi\n",
"%pip install data-prep-toolkit\n",
"%pip install data-prep-toolkit-transforms==0.2.2.dev3"
]
},
{
"cell_type": "markdown",
"id": "1d695832-16bc-48d3-a9c3-6ce650ae4a5c",
"metadata": {},
"source": [
"**** Configure the transform parameters. The set of dictionary keys holding DocQualityTransform configuration for values are as follows:\n",
" - model_name_or_path - specify the HAP model, which should be compatible with HuggingFace's AutoModelForSequenceClassification. Defaults to IBM's open-source toxicity classifier ibm-granite/granite-guardian-hap-38m.\n",
" - annotation_column - the column name containing hap (toxicity) score in the output .parquet file. Defaults to hap_score.\n",
" - doc_text_column - the column name containing the document text in the input .parquet file. Defaults to contents.\n",
" - batch_size - modify it based on the infrastructure capacity. Defaults to 128.\n",
" - max_length - the maximum length for the tokenizer. Defaults to 512."
]
},
{
"cell_type": "markdown",
"id": "3f9dbf94-2db4-492d-bbcb-53ac3948c256",
"metadata": {},
"source": [
"***** Import required classes and modules"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "38aebf49-9460-4951-bb04-7045dec28690",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package punkt_tab to /Users/ian/nltk_data...\n",
"[nltk_data] Package punkt_tab is already up-to-date!\n"
]
}
],
"source": [
"import ast\n",
"import os\n",
"import sys\n",
"\n",
"from data_processing.runtime.pure_python import PythonTransformLauncher\n",
"from data_processing.utils import ParamsUtils\n",
"from hap_transform_python import HAPPythonTransformConfiguration"
]
},
{
"cell_type": "markdown",
"id": "f443108f-40e4-40e5-a052-e8a7f4fbccdf",
"metadata": {},
"source": [
"***** Setup runtime parameters for this transform"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "6a8ec5e4-1f52-4c61-9c9e-4618f9034b80",
"metadata": {},
"outputs": [],
"source": [
"# create parameters\n",
"__file__ = os.getcwd()\n",
"input_folder = os.path.abspath(os.path.join(os.path.dirname(__file__), \"../test-data/input\"))\n",
"output_folder = os.path.abspath(os.path.join(os.path.dirname(__file__), \"../output\"))\n",
"local_conf = {\n",
" \"input_folder\": input_folder,\n",
" \"output_folder\": output_folder,\n",
"}\n",
"code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n",
"\n",
"params = {\n",
" \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n",
" \"runtime_pipeline_id\": \"pipeline_id\",\n",
" \"runtime_job_id\": \"job_id\",\n",
" \"runtime_code_location\": ParamsUtils.convert_to_ast(code_location),\n",
"}\n",
"\n",
"\n",
"hap_params = {\n",
" \"model_name_or_path\": 'ibm-granite/granite-guardian-hap-38m',\n",
" \"annotation_column\": \"hap_score\",\n",
" \"doc_text_column\": \"contents\",\n",
" \"inference_engine\": \"CPU\",\n",
" \"max_length\": 512,\n",
" \"batch_size\": 128,\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "d70abda8-3d66-4328-99ce-4075646a7756",
"metadata": {},
"source": [
"***** Use python runtime to invoke the transform"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "94e908e2-1891-4dc7-9f85-85bbf8d44c5e",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"11:29:11 INFO - hap params are {'model_name_or_path': 'ibm-granite/granite-guardian-hap-38m', 'annotation_column': 'hap_score', 'doc_text_column': 'contents', 'inference_engine': 'CPU', 'max_length': 512, 'batch_size': 128} \n",
"11:29:11 INFO - pipeline id pipeline_id\n",
"11:29:11 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}\n",
"11:29:11 INFO - data factory data_ is using local data access: input_folder - /Users/ian/Desktop/data-prep-kit/transforms/universal/hap/test-data/input output_folder - /Users/ian/Desktop/data-prep-kit/transforms/universal/hap/output\n",
"11:29:11 INFO - data factory data_ max_files -1, n_sample -1\n",
"11:29:11 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n",
"11:29:11 INFO - orchestrator hap started at 2024-12-03 11:29:11\n",
"11:29:11 ERROR - No input files to process - exiting\n",
"11:29:11 INFO - Completed execution in 0.0 min, execution result 0\n"
]
}
],
"source": [
"%%capture\n",
"sys.argv = ParamsUtils.dict_to_req(d=params | hap_params)\n",
"launcher = PythonTransformLauncher(runtime_config=HAPPythonTransformConfiguration())\n",
"launcher.launch()"
]
},
{
"cell_type": "markdown",
"id": "0bd4ad5c-a1d9-4ea2-abb7-e43571095392",
"metadata": {},
"source": [
"**** The specified folder will include the transformed parquet files."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "f21d5d9b-562d-4530-8cea-2de5b63eb1dc",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['../output/metadata.json', '../output/test1.parquet']"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# the outputs will be located in the following folders\n",
"import glob\n",
"glob.glob(\"../output/*\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2cd3367a-205f-4d33-83fb-106e32173bc0",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

0 comments on commit b2625d0

Please sign in to comment.