IBM · touma-I · Nov 22, 2024 · Nov 13, 2024 · Nov 13, 2024 · Nov 13, 2024
diff --git a/transforms/language/doc_chunk/doc_chunk.ipynb b/transforms/language/doc_chunk/doc_chunk.ipynb
@@ -0,0 +1,192 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "afd55886-5f5b-4794-838e-ef8179fb0394",
+   "metadata": {},
+   "source": [
+    "##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n",
+    "```\n",
+    "make venv\n",
+    "source venv/bin/activate && pip install jupyterlab\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "## This is here as a reference only\n",
+    "# Users and application developers must use the right tag for the latest from pypi\n",
+    "#!pip install data-prep-toolkit\n",
+    "#!pip install data-prep-toolkit-transforms\n",
+    "#!pip install data-prep-connector"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "407fd4e4-265d-4ec7-bbc9-b43158f5f1f3",
+   "metadata": {
+    "jp-MarkdownHeadingCollapsed": true
+   },
+   "source": [
+    "##### **** Configure the transform parameters. We will only show the use of data_files_to_use and doc_chunk_chunking_type. For a complete list of parameters, please refer to the README.md for this transform\n",
+    "##### \n",
+    "| parameter:type | value | Description |\n",
+    "| --- | --- | --- |\n",
+    "|data_files_to_use: list | .parquet | Process all parquet files in the input folder |\n",
+    "| doc_chunk_chunking_type: str | dl_json | |\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ebf1f782-0e61-485c-8670-81066beb734c",
+   "metadata": {},
+   "source": [
+    "##### ***** Import required Classes and modules"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "c2a12abc-9460-4e45-8961-873b48a9ab19",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import ast\n",
+    "import os\n",
+    "import sys\n",
+    "\n",
+    "from data_processing.runtime.pure_python import PythonTransformLauncher\n",
+    "from data_processing.utils import ParamsUtils\n",
+    "from doc_chunk_transform_python import DocChunkPythonTransformConfiguration\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7234563c-2924-4150-8a31-4aec98c1bf33",
+   "metadata": {},
+   "source": [
+    "##### ***** Setup runtime parameters for this transform"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "e90a853e-412f-45d7-af3d-959e755aeebb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# create parameters\n",
+    "input_folder = os.path.join(\"python\", \"test-data\", \"input\")\n",
+    "output_folder = os.path.join( \"python\", \"output\")\n",
+    "local_conf = {\n",
+    "    \"input_folder\": input_folder,\n",
+    "    \"output_folder\": output_folder,\n",
+    "}\n",
+    "params = {\n",
+    "    \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n",
+    "    \"data_files_to_use\": ast.literal_eval(\"['.parquet']\"),\n",
+    "    \"runtime_pipeline_id\": \"pipeline_id\",\n",
+    "    \"runtime_job_id\": \"job_id\",\n",
+    "    \"doc_chunk_chunking_type\": \"dl_json\",\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7949f66a-d207-45ef-9ad7-ad9406f8d42a",
+   "metadata": {},
+   "source": [
+    "##### ***** Use python runtime to invoke the transform"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "0775e400-7469-49a6-8998-bd4772931459",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "15:19:48 INFO - pipeline id pipeline_id\n",
+      "15:19:48 INFO - code location None\n",
+      "15:19:48 INFO - data factory data_ is using local data access: input_folder - python/test-data/input output_folder - python/output\n",
+      "15:19:48 INFO - data factory data_ max_files -1, n_sample -1\n",
+      "15:19:48 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n",
+      "15:19:48 INFO - orchestrator doc_chunk started at 2024-11-20 15:19:48\n",
+      "15:19:48 INFO - Number of files is 1, source profile {'max_file_size': 0.011513710021972656, 'min_file_size': 0.011513710021972656, 'total_file_size': 0.011513710021972656}\n",
+      "15:19:48 INFO - Completed 1 files (100.0%) in 0.001 min\n",
+      "15:19:48 INFO - Done processing 1 files, waiting for flush() completion.\n",
+      "15:19:48 INFO - done flushing in 0.0 sec\n",
+      "15:19:48 INFO - Completed execution in 0.001 min, execution result 0\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%capture\n",
+    "sys.argv = ParamsUtils.dict_to_req(d=params)\n",
+    "launcher = PythonTransformLauncher(runtime_config=DocChunkPythonTransformConfiguration())\n",
+    "launcher.launch()\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c3df5adf-4717-4a03-864d-9151cd3f134b",
+   "metadata": {},
+   "source": [
+    "##### **** The specified folder will include the transformed parquet files."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "7276fe84-6512-4605-ab65-747351e13a7c",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['python/output/metadata.json', 'python/output/test1.parquet']"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import glob\n",
+    "glob.glob(\"python/output/*\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/transforms/language/doc_chunk/python/README.md b/transforms/language/doc_chunk/python/README.md
@@ -1,5 +1,16 @@
 # Chunk documents Transform 
 
+Please see the set of
+[transform project conventions](../../../README.md#transform-project-conventions)
+for details on general project conventions, transform configuration,
+testing and IDE set up.
+
+## Contributors
+
+- Michele Dolfi ([email protected])
+
+## Description 
+
 This transform is chunking documents. It supports multiple _chunker modules_ (see the `chunking_type` parameter).
 
 When using documents converted to JSON, the transform leverages the [Docling Core](https://github.com/DS4SD/docling-core) `HierarchicalChunker`
@@ -9,20 +20,26 @@ which provides the required JSON structure.
 
 When using documents converted to Markdown, the transform leverages the [Llama Index](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/#markdownnodeparser) `MarkdownNodeParser`, which is relying on its internal Markdown splitting logic.
 
-## Output format
+
+### Input 
+
+| input column name | data type | description |
+|-|-|-|
+| the one specified in _content_column_name_ configuration | string | the content used in this transform |
+
+
+### Output format
 
 The output parquet file will contain all the original columns, but the content will be replaced with the individual chunks.
 
 
-### Tracing the origin of the chunks
+#### Tracing the origin of the chunks
 
 The transform allows to trace the origin of the chunk with the `source_doc_id` which is set to the value of the `document_id` column (if present) in the input table.
 The actual name of columns can be customized with the parameters described below.
 
 
-## Running
-
-### Parameters
+## Configuration
 
 The transform can be tuned with the following parameters.
 
@@ -40,6 +57,12 @@ The transform can be tuned with the following parameters.
 | `output_pageno_column_name`  | `page_number` | Column name to store the page number of the chunk in the output table. |
 | `output_bbox_column_name`    | `bbox` | Column name to store the bbox of the chunk in the output table. |
 
+
+
+## Usage
+
+### Launched Command Line Options 
+
 When invoking the CLI, the parameters must be set as `--doc_chunk_<name>`, e.g. `--doc_chunk_column_name_key=myoutput`.
 
 
@@ -63,8 +86,32 @@ ls output
 ```
 To see results of the transform.
 
+### Code example
+
+TBD (link to the notebook will be provided)
+
+See the sample script [src/doc_chunk_local_python.py](src/doc_chunk_local_python.py).
+
+
 ### Transforming data using the transform image
 
 To use the transform image to transform your data, please refer to the 
 [running images quickstart](../../../../doc/quick-start/run-transform-image.md),
 substituting the name of this transform image and runtime as appropriate.
+
+## Testing
+
+Following [the testing strategy of data-processing-lib](../../../../data-processing-lib/doc/transform-testing.md)
+
+Currently we have:
+- [Unit test](test/test_doc_chunk_python.py)
+
+
+## Further Resource
+
+- For the [Docling Core](https://github.com/DS4SD/docling-core) `HierarchicalChunker`
+  - <https://ds4sd.github.io/docling/>
+- For the Markdown chunker in LlamaIndex
+  - [Markdown chunking](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/#markdownnodeparser)
+- For the Token Text Splitter in LlamaIndex
+  - [Token Text Splitter](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/token_text_splitter/)
diff --git a/transforms/language/doc_chunk/python/requirements.txt b/transforms/language/doc_chunk/python/requirements.txt
@@ -1,3 +1,4 @@
 data-prep-toolkit==0.2.2.dev2
 docling-core==2.3.0
+pydantic>=2.0.0,<2.10.0 
 llama-index-core>=0.11.22,<0.12.0
diff --git a/transforms/language/pdf2parquet/README.md b/transforms/language/pdf2parquet/README.md
@@ -1,10 +1,10 @@
-# PDF2PARQUET Transform 
+# Pdf2Parquet Transform 
 
 
-The PDF2PARQUET transforms iterate through PDF files or zip of PDF files and generates parquet files
-containing the converted document in Markdown format.
+The Pdf2Parquet transforms iterate through PDF, Docx, Pptx, Images files or zip of files and generates parquet files
+containing the converted document in Markdown or JSON format.
 
-The PDF conversion is using the [Docling package](https://github.com/DS4SD/docling).
+The conversion is using the [Docling package](https://github.com/DS4SD/docling).
 
 The following runtimes are available: