From bd724237c3f9cd89ddb64704d423310d203bcae3 Mon Sep 17 00:00:00 2001 From: Sujee Maniyam Date: Tue, 15 Oct 2024 23:19:35 -0700 Subject: [PATCH 01/10] DPK intro example v1 Signed-off-by: Sujee Maniyam --- examples/notebooks/intro/README.md | 13 + .../notebooks/intro/dpk_intro_1_python.ipynb | 4204 +++++++++++++++++ .../notebooks/intro/dpk_intro_1_ray.ipynb | 3909 +++++++++++++++ .../data-prep-kit-3-workflow.excalidraw | 2832 +++++++++++ .../intro/images/data-prep-kit-3-workflow.png | Bin 0 -> 101303 bytes .../intro/input/solar-system/earth.md | 17 + .../intro/input/solar-system/earth.pdf | Bin 0 -> 58535 bytes .../intro/input/solar-system/mars.md | 17 + .../intro/input/solar-system/mars.pdf | Bin 0 -> 57872 bytes examples/notebooks/intro/my_utils.py | 55 + 10 files changed, 11047 insertions(+) create mode 100644 examples/notebooks/intro/README.md create mode 100644 examples/notebooks/intro/dpk_intro_1_python.ipynb create mode 100644 examples/notebooks/intro/dpk_intro_1_ray.ipynb create mode 100644 examples/notebooks/intro/images/data-prep-kit-3-workflow.excalidraw create mode 100644 examples/notebooks/intro/images/data-prep-kit-3-workflow.png create mode 100644 examples/notebooks/intro/input/solar-system/earth.md create mode 100644 examples/notebooks/intro/input/solar-system/earth.pdf create mode 100644 examples/notebooks/intro/input/solar-system/mars.md create mode 100644 examples/notebooks/intro/input/solar-system/mars.pdf create mode 100644 examples/notebooks/intro/my_utils.py diff --git a/examples/notebooks/intro/README.md b/examples/notebooks/intro/README.md new file mode 100644 index 000000000..53d21433c --- /dev/null +++ b/examples/notebooks/intro/README.md @@ -0,0 +1,13 @@ +# Data Prep Kit Introduction + +This is an example featuring some of the features of data prep kit. + +## Running the code + +## Intro + +This notebook will demonstrate processing PDFs + +`PDFs ---> text ---> chunks ---> exact dedupe ---> fuzzy dedupe ---> embeddings` + +[python version](dpk_intro_1_python.ipynb)   |   [ray version](dpk_intro_1_ray.ipynb) diff --git a/examples/notebooks/intro/dpk_intro_1_python.ipynb b/examples/notebooks/intro/dpk_intro_1_python.ipynb new file mode 100644 index 000000000..6f4cf757e --- /dev/null +++ b/examples/notebooks/intro/dpk_intro_1_python.ipynb @@ -0,0 +1,4204 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866", + "metadata": { + "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866" + }, + "source": [ + "# Data Prep Kit Demo 1 - Python version\n", + "\n", + "This notebook will introduce DPK and showcase some of it's capabilities.\n", + "\n", + "Here is the workflow\n", + "\n", + "![](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/notebooks/intro/images/data-prep-kit-3-workflow.png)\n" + ] + }, + { + "cell_type": "markdown", + "id": "b15976e3", + "metadata": { + "id": "b15976e3" + }, + "source": [ + "## How to run this notebook\n", + "\n", + "Two options:\n", + "\n", + "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/main/examples/notebooks/intro/dpk_intro_1_python.ipynb)\n", + "- **Option 2 - Local python dev environment:** Setup using this [guide](../../../README.md#-getting-started)\n", + "\n", + "The notebook will work as in both environments" + ] + }, + { + "cell_type": "markdown", + "id": "eb8b0d5c", + "metadata": { + "id": "eb8b0d5c" + }, + "source": [ + "## Step-1: Inspect the Data\n", + "\n", + "We will use simple PDFs about Solar system. The files are [here](https://github.com/sujee/data-prep-kit/tree/main/examples/notebooks/intro/input/solar-system)\n", + "\n", + "- [earth.pdf](https://github.com/sujee/data-prep-kit/blob/main/examples/notebooks/intro/input/solar-system/earth.pdf)\n", + "- [mars.pdf](https://github.com/sujee/data-prep-kit-examples/blob/main/data/solar-system/mars.pdf)\n" + ] + }, + { + "cell_type": "markdown", + "id": "39a0ab6e", + "metadata": { + "id": "39a0ab6e" + }, + "source": [ + "## Step-2: Figure out Runtime Environment\n", + "\n", + "### 2.1 - Determine runtime\n", + "\n", + "Determine if we are running on Google colab or local python environment" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "1fe354b7", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "1fe354b7", + "outputId": "0a38a7b5-238e-433a-c378-78444908aa8a" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NOT in Colab\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", + " print(\"Running in Colab\")\n", + " RUNNING_IN_COLAB = True\n", + "else:\n", + " print(\"NOT in Colab\")\n", + " RUNNING_IN_COLAB = False" + ] + }, + { + "cell_type": "markdown", + "id": "8e7c104b", + "metadata": { + "id": "8e7c104b" + }, + "source": [ + "### 2.2 -Download Data if running on Google Colab" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "3309799e", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "3309799e", + "outputId": "9b44b764-d284-4da1-ad55-f08d5c9c0f89" + }, + "outputs": [], + "source": [ + "if RUNNING_IN_COLAB:\n", + " !mkdir -p 'input'\n", + " !wget -O 'input/earth.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/input/solar-system/earth.pdf'\n", + " !wget -O 'input/mars.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/input/solar-system/mars.pdf'\n", + " !wget -O 'utils.py' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/my_utils.py'" + ] + }, + { + "cell_type": "markdown", + "id": "a5dc2b68", + "metadata": { + "id": "a5dc2b68" + }, + "source": [ + "### 2.3 - Install dependencies if running on Google Colab" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "1fcec577", + "metadata": { + "id": "1fcec577" + }, + "outputs": [], + "source": [ + "if RUNNING_IN_COLAB:\n", + " ! pip install --default-timeout=100 \\\n", + " data-prep-toolkit[ray]==0.2.2.dev1 \\\n", + " data-prep-toolkit-transforms[ray,all]==0.2.2.dev1 \\\n", + " deepsearch-toolkit\n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "243322b8", + "metadata": { + "id": "243322b8" + }, + "source": [ + "### 2.4 - Restart Runtime\n", + "\n", + "After installing dependencies, be sure restart runtime, so libraries will be loaded\n", + "\n", + "You do this by going to **`Runtime --> Restart Session`**\n", + "\n", + "Then you can continue to the next step (no need to re-run the notebook)" + ] + }, + { + "cell_type": "markdown", + "id": "e8b10be1", + "metadata": { + "id": "e8b10be1" + }, + "source": [ + "## Step-2: Configuration" + ] + }, + { + "cell_type": "markdown", + "id": "356c66f7", + "metadata": { + "id": "356c66f7" + }, + "source": [ + "### 2.1 - Basic Config" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "e4YMZrBuFycl", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "e4YMZrBuFycl", + "outputId": "42a9edae-205f-4dce-cd4e-a159bd8f620b" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NOT in Colab\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", + " print(\"Running in Colab\")\n", + " RUNNING_IN_COLAB = True\n", + "else:\n", + " print(\"NOT in Colab\")\n", + " RUNNING_IN_COLAB = False" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "33345487", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "33345487", + "outputId": "79b40d76-b4dd-48ea-9638-461c78a637a1" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "MY_CONFIG.RAY_RUNTIME_WORKERS: 2\n", + "MY_CONFIG.RAY_NUM_CPUS: 0.8\n", + "MY_CONFIG.RAY_MEMORY_GB: 2\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "## Configuration\n", + "class MyConfig:\n", + " pass\n", + "\n", + "MY_CONFIG = MyConfig ()\n", + "\n", + "if RUNNING_IN_COLAB:\n", + " MY_CONFIG.INPUT_DATA_DIR = 'input'\n", + "else:\n", + " MY_CONFIG.INPUT_DATA_DIR = os.path.join (os.path.abspath (''), '..', 'data', 'solar-system')\n", + " \n", + "MY_CONFIG.OUTPUT_FOLDER = \"output\"\n", + "MY_CONFIG.OUTPUT_FOLDER_FINAL = os.path.join(MY_CONFIG.OUTPUT_FOLDER , \"output_final\")\n", + "\n", + "## Embedding model\n", + "MY_CONFIG.EMBEDDING_MODEL = 'sentence-transformers/all-MiniLM-L6-v2'\n", + "\n", + "## RAY CONFIGURATION\n", + "### For local runs, we can use more parallelism\n", + "### For google colab, be conservative\n", + "\n", + "if RUNNING_IN_COLAB:\n", + " MY_CONFIG.RAY_RUNTIME_WORKERS = 2\n", + " MY_CONFIG.RAY_NUM_CPUS = 0.3\n", + " MY_CONFIG.RAY_MEMORY_GB = 2 # GB\n", + "else: # local run\n", + " num_cpus_available = os.cpu_count()\n", + " # print (num_cpus_available)\n", + " MY_CONFIG.RAY_NUM_CPUS = 0.8\n", + " MY_CONFIG.RAY_MEMORY_GB = 2 # GB\n", + " # MY_CONFIG.RAY_RUNTIME_WORKERS = num_cpus_available // 3\n", + " MY_CONFIG.RAY_RUNTIME_WORKERS = 2\n", + "\n", + "print ('MY_CONFIG.RAY_RUNTIME_WORKERS:', MY_CONFIG.RAY_RUNTIME_WORKERS)\n", + "print ('MY_CONFIG.RAY_NUM_CPUS:', MY_CONFIG.RAY_NUM_CPUS)\n", + "print ('MY_CONFIG.RAY_MEMORY_GB:', MY_CONFIG.RAY_MEMORY_GB)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "b15e6827", + "metadata": { + "id": "b15e6827" + }, + "outputs": [], + "source": [ + "## Add parent dir to path\n", + "import os,sys\n", + "\n", + "this_dir = os.path.abspath('')\n", + "parent_dir = os.path.dirname(this_dir)\n", + "sys.path.append (os.path.abspath (parent_dir))" + ] + }, + { + "cell_type": "markdown", + "id": "72510ae6-48b0-4b88-9e13-a623281c3a63", + "metadata": { + "id": "72510ae6-48b0-4b88-9e13-a623281c3a63" + }, + "source": [ + "### 2.2 - Setup input/outpur directories" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "60ac8bee-0960-4309-b225-d7a211b14262", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "60ac8bee-0960-4309-b225-d7a211b14262", + "outputId": "5c305d54-1c91-455d-d0e2-b514b61a068b" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Cleared output directory\n" + ] + } + ], + "source": [ + "import os, sys\n", + "import shutil\n", + "\n", + "if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):\n", + " raise Exception (f\"❌ Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found\")\n", + "\n", + "output_parquet_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '01_parquet_out')\n", + "output_chunk_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_chunk_out')\n", + "output_docid_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '03_docid_out')\n", + "output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '04_exact_dedupe_out')\n", + "output_fuzzy_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '05_fuzzy_dedupe_out')\n", + "output_embeddings_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '06_embeddings_out')\n", + "\n", + "## clear output folder\n", + "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)\n", + "shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)\n", + "\n", + "print (\"✅ Cleared output directory\")" + ] + }, + { + "cell_type": "markdown", + "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb", + "metadata": { + "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb" + }, + "source": [ + "## Step-3: pdf2parquet - Convert data from PDF to Parquet\n", + "\n", + "This step is reading the input folder containing all PDF files and ingest them in a parquet table using the [Docling package](https://github.com/DS4SD/docling).\n", + "The documents are converted into a JSON format which allows to easily chunk it in the later steps.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "c0c574c4-9dc4-4dab-9ad6-b5338207e67a", + "metadata": { + "id": "c0c574c4-9dc4-4dab-9ad6-b5338207e67a" + }, + "source": [ + "### 3.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "482605b2-d814-456d-9195-49a2ec454ef0", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "482605b2-d814-456d-9195-49a2ec454ef0", + "outputId": "90eb1f89-35d1-4b6f-ea34-7667680dd256" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-1: Processing input='/home/sujee/my-stuff/projects/ai-alliance/data-prep-kit-examples/dpk-intro/../data/solar-system' --> output='output/01_parquet_out'\n" + ] + } + ], + "source": [ + "STAGE = 1\n", + "\n", + "input_folder = MY_CONFIG.INPUT_DATA_DIR\n", + "output_folder = output_parquet_dir\n", + "\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b", + "metadata": { + "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b" + }, + "source": [ + "### 3.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 625, + "referenced_widgets": [ + "8226b2522ce446f6bd3a36c4e227370c", + "7616f1b493e1461c9fd1319fae3bc10b", + "4f63bfad92b64e7bae18e720376d402d", + "6957a659451b46dab702c1c62fa9cdd2", + "2eea7bc810e54eaeb325136352b71e66", + "ebc626c0750c470db6789b26acf15f60", + "3077f04af3a9447ab98717bd3131cd8f", + "709685da1c6c4164bed658357a2191bf", + "0a1ed94698ca4e4291c553929e0ca66c", + "5dbc6889a9c243c5a922f8cc5f1a704c", + "d6e520e4da004c818031ccfcc3588e5d" + ] + }, + "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", + "outputId": "e2c85b44-f605-4817-c120-2cdce79e3c84" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:40:02 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", + "18:40:02 INFO - pipeline id pipeline_id\n", + "18:40:02 INFO - code location None\n", + "18:40:02 INFO - data factory data_ is using local data access: input_folder - /home/sujee/my-stuff/projects/ai-alliance/data-prep-kit-examples/dpk-intro/../data/solar-system output_folder - output/01_parquet_out\n", + "18:40:02 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:40:02 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", + "18:40:02 INFO - orchestrator pdf2parquet started at 2024-09-18 18:40:02\n", + "18:40:02 INFO - Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}\n", + "18:40:02 INFO - Initializing models\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "6454e0eb538145aebeed98e2ec662b22", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Fetching 7 files: 0%| | 0/7 [00:00\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filename
0mars.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdf
1earth.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdf
\n", + "" + ], + "text/plain": [ + " filename contents num_pages \\\n", + "0 mars.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", + "1 earth.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", + "\n", + " num_tables num_doc_elements document_id ext \\\n", + "0 0 11 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "1 0 11 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \n", + "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "1 2024-09-18T18:40:06.831334 0.857239 earth.pdf " + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Output dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(5)\n", + "\n", + "## To display certain columns\n", + "#parquet_df[['column1', 'column2', 'column3']].head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "e5058a21", + "metadata": { + "id": "e5058a21" + }, + "source": [ + "\n", + "### 3.4 - Understand the output\n", + "\n", + "Here are some interesting attributes to note:\n", + "\n", + "- **filename** : original filename\n", + "- **contents** : text\n", + "- **document_id**: unique id (UUID) assignd to this document\n", + "- **hash** : hash of document\n", + "- **pdf_convert_time** : time to convert this pdf in seconds\n", + "\n", + "Let's inspect the **contents** column. See how the text is being divided up!" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "f870e624", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "f870e624", + "outputId": "f70bfa9f-62f8-417d-d91a-30c1f024ccbd" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'_name': '',\n", + " 'description': {'logs': []},\n", + " 'equations': [],\n", + " 'figures': [],\n", + " 'file-info': {'#-pages': 1,\n", + " 'document-hash': '1a83f43f3a202e3f203c1263e36961ecc45d401aad488f638fc5559a584333b2',\n", + " 'filename': 'mars.pdf',\n", + " 'page-hashes': [{'hash': '551fe7a9bde2a9302f150c0a79a13fcc0868fcf73ac6afb80be645c1174734a0',\n", + " 'model': 'default',\n", + " 'page': 1}]},\n", + " 'footnotes': [],\n", + " 'main-text': [{'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.35137939,\n", + " 654.45184326,\n", + " 169.88169861,\n", + " 667.98492432],\n", + " 'page': 1,\n", + " 'span': [0, 4]}],\n", + " 'text': 'Mars',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.09541321,\n", + " 630.68127441,\n", + " 210.66503906,\n", + " 642.34405518],\n", + " 'page': 1,\n", + " 'span': [0, 12]}],\n", + " 'text': 'Solar System',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.84518433,\n", + " 588.96014404,\n", + " 479.40917969,\n", + " 623.02520752],\n", + " 'page': 1,\n", + " 'span': [0, 205]}],\n", + " 'text': 'Our solar system is a vast and fascinating expanse, '\n", + " 'comprising eight planets, five dwarf planets, '\n", + " 'numerous moons, asteroids, comets, and other '\n", + " 'celestial bodies. At its center lies the star we call '\n", + " 'the Sun.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [133.18510437,\n", + " 570.83258057,\n", + " 374.99838257,\n", + " 581.07043457],\n", + " 'page': 1,\n", + " 'span': [0, 54]}],\n", + " 'text': 'For more details about the Solar system see Chapter '\n", + " '1.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.22866821,\n", + " 542.98168945,\n", + " 163.86282349,\n", + " 554.45288086],\n", + " 'page': 1,\n", + " 'span': [0, 4]}],\n", + " 'text': 'Mars',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.87440491,\n", + " 500.84011841,\n", + " 477.48345947,\n", + " 534.55810547],\n", + " 'page': 1,\n", + " 'span': [0, 196]}],\n", + " 'text': 'Mars, the fourth planet from the Sun, is a cold, '\n", + " 'desert world with a thin atmosphere composed '\n", + " 'primarily of carbon dioxide. Its reddish hue comes '\n", + " 'from iron oxide, or rust, prevalent on its surface.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.2026062,\n", + " 482.90710449,\n", + " 237.04431152,\n", + " 493.07443237],\n", + " 'page': 1,\n", + " 'span': [0, 23]}],\n", + " 'text': 'Basic facts about Mars:',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 453.019104,\n", + " 477.48171997,\n", + " 474.9703064],\n", + " 'page': 1,\n", + " 'span': [0, 78]}],\n", + " 'text': '· Distance from the Sun: Average of 228 million '\n", + " 'kilometers (142 million miles)',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 440.79351807,\n", + " 431.73287964,\n", + " 451.2142334],\n", + " 'page': 1,\n", + " 'span': [0, 64]}],\n", + " 'text': '· Rotation Period: 24.6 hours (one Martian day - '\n", + " 'called a \"sol\")',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 429.10913086,\n", + " 365.9559021,\n", + " 438.83737183],\n", + " 'page': 1,\n", + " 'span': [0, 44]}],\n", + " 'text': '· Moons: Two small moons, Phobos and Deimos.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Page-footer',\n", + " 'prov': [{'bbox': [303.13299561,\n", + " 87.20314026,\n", + " 308.11428833,\n", + " 96.51646423],\n", + " 'page': 1,\n", + " 'span': [0, 1]}],\n", + " 'text': '1',\n", + " 'type': 'page-footer'}],\n", + " 'page-dimensions': [{'height': 792.0, 'page': 1, 'width': 612.0}],\n", + " 'page-footers': [],\n", + " 'page-headers': [],\n", + " 'tables': [],\n", + " 'type': 'pdf-document'}\n" + ] + } + ], + "source": [ + "import pprint\n", + "import json\n", + "\n", + "pprint.pprint (json.loads(output_df.iloc[0, ]['contents']))\n", + "# json.loads(output_df.iloc[0, ]['contents'])" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "e1a10c2d", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "e1a10c2d", + "outputId": "300e7688-692a-4039-c4a4-a86887d9138b" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'_name': '',\n", + " 'description': {'logs': []},\n", + " 'equations': [],\n", + " 'figures': [],\n", + " 'file-info': {'#-pages': 1,\n", + " 'document-hash': '7401ae81637dbb89e7040dcd5945bbfb75ff8648bb761c69f8a1595e86538748',\n", + " 'filename': 'earth.pdf',\n", + " 'page-hashes': [{'hash': 'ca802e4bd5a3301792808caea2a47db51f0520888875b77fc230c99ee851c19b',\n", + " 'model': 'default',\n", + " 'page': 1}]},\n", + " 'footnotes': [],\n", + " 'main-text': [{'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.30961609,\n", + " 654.45184326,\n", + " 174.04208374,\n", + " 667.93347168],\n", + " 'page': 1,\n", + " 'span': [0, 5]}],\n", + " 'text': 'Earth',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.12528992,\n", + " 630.69073486,\n", + " 210.66503906,\n", + " 642.27935791],\n", + " 'page': 1,\n", + " 'span': [0, 12]}],\n", + " 'text': 'Solar System',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.87112427,\n", + " 588.96014404,\n", + " 479.40917969,\n", + " 623.04595947],\n", + " 'page': 1,\n", + " 'span': [0, 205]}],\n", + " 'text': 'Our solar system is a vast and fascinating expanse, '\n", + " 'comprising eight planets, five dwarf planets, '\n", + " 'numerous moons, asteroids, comets, and other '\n", + " 'celestial bodies. At its center lies the star we call '\n", + " 'the Sun.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [133.20942688,\n", + " 570.81555176,\n", + " 375.57919312,\n", + " 581.08459473],\n", + " 'page': 1,\n", + " 'span': [0, 54]}],\n", + " 'text': 'For more details about our Solar system see Chapter '\n", + " '1.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.15542603,\n", + " 542.98168945,\n", + " 167.32983398,\n", + " 554.36669922],\n", + " 'page': 1,\n", + " 'span': [0, 5]}],\n", + " 'text': 'Earth',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.91053772,\n", + " 512.46295166,\n", + " 477.84887695,\n", + " 534.48431396],\n", + " 'page': 1,\n", + " 'span': [0, 107]}],\n", + " 'text': \"Earth is the third planet from the Sun. It's our home \"\n", + " 'planet. Earth is the only place we know of with life.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [133.30151367,\n", + " 494.86206055,\n", + " 240.17156982,\n", + " 505.07229614],\n", + " 'page': 1,\n", + " 'span': [0, 24]}],\n", + " 'text': 'Basic facts about Earth:',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 464.97409058,\n", + " 477.47979736,\n", + " 487.02810669],\n", + " 'page': 1,\n", + " 'span': [0, 79]}],\n", + " 'text': '· Distance from the Sun: Average of 149.6 million '\n", + " 'kilometers (93 million miles)',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 452.86901855,\n", + " 317.90722656,\n", + " 463.24041748],\n", + " 'page': 1,\n", + " 'span': [0, 37]}],\n", + " 'text': '· Rotation Period: 24 hours (one day)',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 440.71496582,\n", + " 396.66357422,\n", + " 451.19915771],\n", + " 'page': 1,\n", + " 'span': [0, 52]}],\n", + " 'text': '· Moons: One moon, called Luna or simply \"the Moon\".',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Page-footer',\n", + " 'prov': [{'bbox': [303.13299561,\n", + " 87.20314026,\n", + " 308.11428833,\n", + " 96.53633118],\n", + " 'page': 1,\n", + " 'span': [0, 1]}],\n", + " 'text': '1',\n", + " 'type': 'page-footer'}],\n", + " 'page-dimensions': [{'height': 792.0, 'page': 1, 'width': 612.0}],\n", + " 'page-footers': [],\n", + " 'page-headers': [],\n", + " 'tables': [],\n", + " 'type': 'pdf-document'}\n" + ] + } + ], + "source": [ + "pprint.pprint (json.loads(output_df.iloc[1, ]['contents']))" + ] + }, + { + "cell_type": "markdown", + "id": "72274586", + "metadata": { + "id": "72274586" + }, + "source": [ + "## Step-4: Doc chunks\n", + "\n", + "In the previous step, we have extracted text from oru PDFs. But we have the content of entire file as 'one row' in our parquet output.\n", + "\n", + "In this step, we are going to split the documents in chunks, according to their layout segmentation.\n", + "\n", + "This transform uses [Quackling](https://github.com/DS4SD/quackling) `HierarchicalChunker`\n", + "to chunk according to the document layout segmentation, i.e. respecting the original document components as paragraphs, tables, enumerations, etc.\n", + "It relies on documents converted with the Docling library in the [pdf2parquet transform](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/python/README.md) using the option `contents_type: \"application/json\"`,\n", + "which provides the required JSON structure." + ] + }, + { + "cell_type": "markdown", + "id": "96198fa6", + "metadata": { + "id": "96198fa6" + }, + "source": [ + "### 4.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "305f00a3", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "305f00a3", + "outputId": "a787385b-214a-41b2-975d-0d3c5529c2c4" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-2: Processing input='output/01_parquet_out' --> output='output/02_chunk_out'\n" + ] + } + ], + "source": [ + "STAGE = 2\n", + "\n", + "input_folder = output_parquet_dir # previous output folder is the input folder for the current stage\n", + "output_folder = output_chunk_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "369f2cd1", + "metadata": { + "id": "369f2cd1" + }, + "source": [ + "### 4.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "5b7b18d5", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "5b7b18d5", + "outputId": "cb338503-3dca-45bd-a60a-bd214843a97b" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:40:09 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'output_chunk_column_name': 'contents', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox'}\n", + "18:40:09 INFO - pipeline id pipeline_id\n", + "18:40:09 INFO - code location None\n", + "18:40:09 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", + "18:40:09 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:40:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "18:40:09 INFO - orchestrator doc_chunk started at 2024-09-18 18:40:09\n", + "18:40:09 INFO - Number of files is 2, source profile {'max_file_size': 0.02239513397216797, 'min_file_size': 0.02167987823486328, 'total_file_size': 0.04407501220703125}\n", + "18:40:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", + "18:40:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", + "18:40:09 INFO - Done processing 2 files, waiting for flush() completion.\n", + "18:40:09 INFO - done flushing in 0.0 sec\n", + "18:40:09 INFO - Completed execution in 0.0 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:2 completed successfully\n", + "CPU times: user 861 ms, sys: 140 ms, total: 1 s\n", + "Wall time: 1.21 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from doc_chunk_transform_python import DocChunkPythonTransformConfiguration\n", + "\n", + "\n", + "# Prepare the commandline params\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "params = {\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # doc_chunk arguments\n", + " # ...\n", + "}\n", + "\n", + "# Pass the commandline params\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# create launcher\n", + "launcher = PythonTransformLauncher(DocChunkPythonTransformConfiguration())\n", + "# launch\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"❌ Job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "213afdf6", + "metadata": { + "id": "213afdf6" + }, + "source": [ + "### 4.3 - Inspect Generated output\n", + "\n", + "We would see documents are split into many chunks" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "d8138d43", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 893 + }, + "id": "d8138d43", + "outputId": "0d08e0a6-e743-44d9-b8f1-eec98b222a92" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Files processed : 2\n", + "Chunks created : 8\n", + "Input data dimensions (rows x columns)= (2, 12)\n", + "Output data dimensions (rows x columns)= (8, 15)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filenamecontentsdoc_jsonpathpage_numberbbox
0mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...
1mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...
2mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...
3mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfBasic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....
4earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...
5earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...
6earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...
7earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "0 mars.pdf 1 0 11 \n", + "1 mars.pdf 1 0 11 \n", + "2 mars.pdf 1 0 11 \n", + "3 mars.pdf 1 0 11 \n", + "4 earth.pdf 1 0 11 \n", + "5 earth.pdf 1 0 11 \n", + "6 earth.pdf 1 0 11 \n", + "7 earth.pdf 1 0 11 \n", + "\n", + " document_id ext \\\n", + "0 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "1 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "2 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "3 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "4 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "5 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "6 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "7 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "1 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "2 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "3 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "4 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "5 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "6 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "7 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "3 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", + "4 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "5 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "7 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox \n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", + "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "7 1 [133.30151367, 494.86206055, 240.17156982, 505... " + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (f\"Files processed : {input_df.shape[0]:,}\")\n", + "print (f\"Chunks created : {output_df.shape[0]:,}\")\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "9e9ca75c", + "metadata": { + "id": "9e9ca75c" + }, + "source": [ + "### 4.4 - Understanding the Output\n", + "\n", + "Here we see 2 PDF files are split into 6 chunks. Basically we see the documents are being split along 'natural boundaris' - paragraphs and bullet points\n", + "\n", + "See how **document_id** is carried throughout. This helps us identify original documents.\n", + "\n", + "Also note **contents** is now plain text (not JSON as before)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "3090c950", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 300 + }, + "id": "3090c950", + "outputId": "cf9bd956-7b31-42bc-ef77-9ebded8ba08e" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontents
0mars.pdfSolar System\\nOur solar system is a vast and f...
1mars.pdfSolar System\\nFor more details about the Solar...
2mars.pdfMars\\nMars, the fourth planet from the Sun, is...
3mars.pdfBasic facts about Mars:\\n· Distance from the S...
4earth.pdfSolar System\\nOur solar system is a vast and f...
5earth.pdfSolar System\\nFor more details about our Solar...
6earth.pdfEarth\\nEarth is the third planet from the Sun....
7earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...
\n", + "
" + ], + "text/plain": [ + " filename contents\n", + "0 mars.pdf Solar System\\nOur solar system is a vast and f...\n", + "1 mars.pdf Solar System\\nFor more details about the Solar...\n", + "2 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", + "3 mars.pdf Basic facts about Mars:\\n· Distance from the S...\n", + "4 earth.pdf Solar System\\nOur solar system is a vast and f...\n", + "5 earth.pdf Solar System\\nFor more details about our Solar...\n", + "6 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", + "7 earth.pdf Earth\\nBasic facts about Earth:\\n· Distance fr..." + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output_df[['filename', 'contents']]" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "d5f151ae", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "d5f151ae", + "outputId": "2b48675c-328d-4d24-d689-ad77231ef4b7" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "========== mars.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "-------\n", + "-------Chunk 1------\n", + "Solar System\n", + "For more details about the Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", + "Mars\n", + "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", + "-------\n", + "-------Chunk 3------\n", + "Basic facts about Mars:\n", + "· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", + "· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", + "· Moons: Two small moons, Phobos and Deimos.\n", + "-------\n", + "========== earth.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "-------\n", + "-------Chunk 1------\n", + "Solar System\n", + "For more details about our Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", + "Earth\n", + "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", + "-------\n", + "-------Chunk 3------\n", + "Earth\n", + "Basic facts about Earth:\n", + "· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", + "· Rotation Period: 24 hours (one day)\n", + "· Moons: One moon, called Luna or simply \"the Moon\".\n", + "-------\n" + ] + } + ], + "source": [ + "for f in output_df['filename'].unique():\n", + " print ('==========' , f, '===========')\n", + " chunks = output_df[output_df['filename'] == f]['contents']\n", + " for idx , chunk in enumerate(chunks):\n", + " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" + ] + }, + { + "cell_type": "markdown", + "id": "7ad1c60d", + "metadata": {}, + "source": [ + "## Step-5: DOC ID generation of Chunks\n", + "\n", + "This transform annotates documents with document \"ids\". It supports the following transformations of the original data:\n", + "\n", + " - Adding document hash: this enables the addition of a document hash-based id to the data. The hash is calculated with `hashlib.sha256(doc.encode(\"utf-8\")).hexdigest()`. To enable this annotation, set **hash_column** to the name of the column, where you want to store it.\n", + " - Adding integer document id: this allows the addition of an integer document id to the data that is unique across all rows in all tables provided to the transform() method. To enable this annotation, set **int_id_column** to the name of the column, where you want to store it.\n", + "\n", + "**This is a pre-requisite for fuzzy dedup** in the pipeline." + ] + }, + { + "cell_type": "markdown", + "id": "1afaa0fd", + "metadata": {}, + "source": [ + "### 5.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "6ffd6f54", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-3: Processing input='output/02_chunk_out' --> output='output/03_docid_out'\n" + ] + } + ], + "source": [ + "\n", + "# Input for this stage is the output of exact dedeup component\n", + "# output of this component makes it possible for fdedup component to run on data.\n", + "\n", + "STAGE = 3\n", + "\n", + "input_folder = output_chunk_dir\n", + "output_folder = output_docid_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "f78a51b7", + "metadata": {}, + "source": [ + "### 5.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "5fc77557", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:40:09 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", + "18:40:09 INFO - pipeline id pipeline_id\n", + "18:40:09 INFO - code location None\n", + "18:40:09 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", + "18:40:09 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:40:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "18:40:09 INFO - orchestrator doc_id started at 2024-09-18 18:40:09\n", + "18:40:09 INFO - Number of files is 2, source profile {'max_file_size': 0.008135795593261719, 'min_file_size': 0.008058547973632812, 'total_file_size': 0.01619434356689453}\n", + "18:40:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", + "18:40:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", + "18:40:09 INFO - Done processing 2 files, waiting for flush() completion.\n", + "18:40:09 INFO - done flushing in 0.0 sec\n", + "18:40:09 INFO - Completed execution in 0.0 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:3 completed successfully\n", + "CPU times: user 19.2 ms, sys: 603 μs, total: 19.8 ms\n", + "Wall time: 16.2 ms\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from doc_id_transform_python import DocIDPythonTransformRuntimeConfiguration\n", + "\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "params = {\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # orchestrator\n", + " # doc id configuration\n", + " \"doc_id_doc_column\": \"contents\",\n", + " \"doc_id_hash_column\": \"chunk_hash\",\n", + " \"doc_id_int_column\": \"chunk_id\",\n", + "}\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# launch\n", + "\n", + "launcher = PythonTransformLauncher(DocIDPythonTransformRuntimeConfiguration())\n", + "\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"❌ Job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "a9a8c1fa", + "metadata": {}, + "source": [ + "### 5.3 - Inspect Generated output\n", + "\n", + "You will notice we have two extra columns\n", + "\n", + "- **hash_column**\n", + "- **int_id_column**\n", + "\n", + "But still the same number or rows as before" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "da9adede", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (8, 15)\n", + "Output data dimensions (rows x columns)= (8, 17)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filenamecontentsdoc_jsonpathpage_numberbboxchunk_hashchunk_id
0mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...4
1mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...5
2mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...6
3mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfBasic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7
4earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0
5earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...1
6earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...2
7earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...3
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "0 mars.pdf 1 0 11 \n", + "1 mars.pdf 1 0 11 \n", + "2 mars.pdf 1 0 11 \n", + "3 mars.pdf 1 0 11 \n", + "4 earth.pdf 1 0 11 \n", + "5 earth.pdf 1 0 11 \n", + "6 earth.pdf 1 0 11 \n", + "7 earth.pdf 1 0 11 \n", + "\n", + " document_id ext \\\n", + "0 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "1 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "2 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "3 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "4 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "5 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "6 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "7 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "1 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "2 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "3 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "4 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "5 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "6 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "7 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "3 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", + "4 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "5 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "7 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox \\\n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", + "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", + "\n", + " chunk_hash chunk_id \n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", + "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", + "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", + "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", + "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 " + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "4692975c-49ff-41ae-810e-0f5bc0bbdc53", + "metadata": { + "id": "4692975c-49ff-41ae-810e-0f5bc0bbdc53" + }, + "source": [ + "## Step-6: Exact Dedup\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "5acfd3a2-a236-4143-bcfc-15804f1da7fe", + "metadata": { + "id": "5acfd3a2-a236-4143-bcfc-15804f1da7fe" + }, + "source": [ + "### 6.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "4c7a1b94", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "4c7a1b94", + "outputId": "2a135853-c54f-4aa4-ffc4-83c2bc7a68ce" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-4: Processing input='output/03_docid_out' --> output='output/04_exact_dedupe_out'\n" + ] + } + ], + "source": [ + "STAGE = 4\n", + "\n", + "input_folder = output_docid_dir # previous output folder is the input folder for the current stage\n", + "output_folder = output_exact_dedupe_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "3661cb37-39c7-4b09-a784-925bfa9eaf1e", + "metadata": { + "id": "3661cb37-39c7-4b09-a784-925bfa9eaf1e" + }, + "source": [ + "### 6.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", + "outputId": "b9b3de92-4304-4540-dfba-a4549fa157eb" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:40:09 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None}\n", + "18:40:09 INFO - pipeline id pipeline_id\n", + "18:40:09 INFO - code location None\n", + "18:40:09 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", + "18:40:09 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:40:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "18:40:09 INFO - orchestrator ededup started at 2024-09-18 18:40:09\n", + "18:40:09 INFO - Number of files is 2, source profile {'max_file_size': 0.009340286254882812, 'min_file_size': 0.0092620849609375, 'total_file_size': 0.018602371215820312}\n", + "18:40:09 INFO - Starting from the beginning\n", + "18:40:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", + "18:40:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", + "18:40:09 INFO - Done processing 2 files, waiting for flush() completion.\n", + "18:40:09 INFO - done flushing in 0.0 sec\n", + "18:40:09 INFO - Completed execution in 0.0 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:4 completed successfully\n", + "CPU times: user 15.4 ms, sys: 478 μs, total: 15.9 ms\n", + "Wall time: 12.9 ms\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from ededup_transform_python import EdedupPythonTransformRuntimeConfiguration\n", + "\n", + "\n", + "# Prepare the commandline params\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "params = {\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # ededup parameters\n", + " \"ededup_doc_column\": \"contents\",\n", + " \"ededup_doc_id_column\": \"chunk_hash\",\n", + "}\n", + "\n", + "# Pass the commandline params\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# create launcher\n", + "launcher = PythonTransformLauncher(EdedupPythonTransformRuntimeConfiguration())\n", + "# launch\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"❌ Job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "eaf1c3c3", + "metadata": { + "id": "eaf1c3c3" + }, + "source": [ + "### 6.3 - Inspect Generated output" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "d824ebf6", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 358 + }, + "id": "d824ebf6", + "outputId": "14aa660f-6f1a-4f93-9b61-5f8f8adcf3fe" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (8, 17)\n", + "Output data dimensions (rows x columns)= (7, 18)\n", + "Input chunks before exact dedupe : 8\n", + "Output chunks after exact dedupe : 7\n", + "Duplicate chunks removed : 1\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filenamecontentsdoc_jsonpathpage_numberbboxchunk_hashchunk_idremoved
0mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...5[44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...
1mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...6[]
2mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfBasic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7[]
3earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0[]
4earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...1[]
5earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...2[]
6earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...3[]
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "0 mars.pdf 1 0 11 \n", + "1 mars.pdf 1 0 11 \n", + "2 mars.pdf 1 0 11 \n", + "3 earth.pdf 1 0 11 \n", + "4 earth.pdf 1 0 11 \n", + "5 earth.pdf 1 0 11 \n", + "6 earth.pdf 1 0 11 \n", + "\n", + " document_id ext \\\n", + "0 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "1 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "2 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "3 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "4 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "5 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "6 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "1 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "2 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "3 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "4 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "5 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "6 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "1 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "2 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", + "3 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "4 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "5 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "6 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox \\\n", + "0 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "1 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "2 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "3 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "4 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "5 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "6 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", + "\n", + " chunk_hash chunk_id \\\n", + "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", + "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", + "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", + "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", + "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 \n", + "\n", + " removed \n", + "0 [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567... \n", + "1 [] \n", + "2 [] \n", + "3 [] \n", + "4 [] \n", + "5 [] \n", + "6 [] " + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "print (f\"Input chunks before exact dedupe : {input_df.shape[0]:,}\")\n", + "print (f\"Output chunks after exact dedupe : {output_df.shape[0]:,}\")\n", + "print (\"Duplicate chunks removed : \", (input_df.shape[0] - output_df.shape[0]))\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "82cc9bb0", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 112 + }, + "id": "82cc9bb0", + "outputId": "2aff0a5f-8cc7-408c-e1cf-62c0b14b18fb" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontents
0mars.pdfSolar System\\nFor more details about the Solar...
1mars.pdfMars\\nMars, the fourth planet from the Sun, is...
2mars.pdfBasic facts about Mars:\\n· Distance from the S...
3earth.pdfSolar System\\nOur solar system is a vast and f...
4earth.pdfSolar System\\nFor more details about our Solar...
5earth.pdfEarth\\nEarth is the third planet from the Sun....
6earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...
\n", + "
" + ], + "text/plain": [ + " filename contents\n", + "0 mars.pdf Solar System\\nFor more details about the Solar...\n", + "1 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", + "2 mars.pdf Basic facts about Mars:\\n· Distance from the S...\n", + "3 earth.pdf Solar System\\nOur solar system is a vast and f...\n", + "4 earth.pdf Solar System\\nFor more details about our Solar...\n", + "5 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", + "6 earth.pdf Earth\\nBasic facts about Earth:\\n· Distance fr..." + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output_df[['filename', 'contents']]" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "cc61dffa", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cc61dffa", + "outputId": "337b015f-3795-4c45-98a3-03ae817d4dca" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "========== mars.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "For more details about the Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 1------\n", + "Mars\n", + "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", + "-------\n", + "-------Chunk 2------\n", + "Basic facts about Mars:\n", + "· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", + "· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", + "· Moons: Two small moons, Phobos and Deimos.\n", + "-------\n", + "========== earth.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "-------\n", + "-------Chunk 1------\n", + "Solar System\n", + "For more details about our Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", + "Earth\n", + "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", + "-------\n", + "-------Chunk 3------\n", + "Earth\n", + "Basic facts about Earth:\n", + "· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", + "· Rotation Period: 24 hours (one day)\n", + "· Moons: One moon, called Luna or simply \"the Moon\".\n", + "-------\n" + ] + } + ], + "source": [ + "for f in output_df['filename'].unique():\n", + " print ('==========' , f, '===========')\n", + " chunks = output_df[output_df['filename'] == f]['contents']\n", + " for idx , chunk in enumerate(chunks):\n", + " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" + ] + }, + { + "cell_type": "markdown", + "id": "383f40ba", + "metadata": { + "id": "383f40ba" + }, + "source": [ + "### 6.4 - Understanding the output\n", + "\n", + "Remember we had 8 chunks initially. Now we have 7! One duplicate chunk is removed.\n", + "\n", + "If you look at the PDF, the following common paragraph in `earth.pdf` and `mars.pdf` is removed from one of the documents! Pretty neat, eh!\n", + "\n", + "```text\n", + "## Solar System\n", + "\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "85309751-8556-41c6-ac32-84acc941bc8d", + "metadata": { + "id": "85309751-8556-41c6-ac32-84acc941bc8d" + }, + "source": [ + " ## Step-7: Fuzzy Dedup\n", + "\n", + "Post exact deduplication, fuzzy deduplication is applied with the goal of removing **very similar** chunks\n", + "\n", + "And fuzzy dedupe is only available in RAY version." + ] + }, + { + "cell_type": "markdown", + "id": "fcf574a3-b287-419c-9c86-07b828b41ca6", + "metadata": { + "id": "fcf574a3-b287-419c-9c86-07b828b41ca6" + }, + "source": [ + "### 7.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399", + "outputId": "4450ed63-3b09-42e4-8085-2951e700cf8f" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-5: Processing input='output/04_exact_dedupe_out' --> output='output/05_fuzzy_dedupe_out'\n" + ] + } + ], + "source": [ + "## Input to this component is the output of doc_id generator component.\n", + "\n", + "STAGE = 5\n", + "\n", + "input_folder = output_exact_dedupe_dir # previous output folder is the input folder for the current stage\n", + "output_folder = output_fuzzy_dedupe_dir\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "f4c82a8f-b513-4fe5-b172-d41b104b54f3", + "metadata": { + "id": "f4c82a8f-b513-4fe5-b172-d41b104b54f3" + }, + "source": [ + "### 7.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f", + "outputId": "2baa790d-6944-4d20-f0c1-fc2979eb1686" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:40:09 INFO - Running locally\n", + "18:40:09 INFO - fuzzy dedup params are {'doc_column': 'contents', 'id_column': 'chunk_id', 'cluster_column': 'chunk_hash', 'bucket_cpu': 0.3, 'mhash_cpu': 0.3, 'doc_cpu': 0.3, 'num_doc_actors': 1, 'num_minhash_actors': 1, 'num_bucket_actors': 1, 'num_preprocessors': 1, 'num_permutations': 64, 'threshold': 0.7, 'shingles_size': 5, 'delimiters': ' ', 'snapshot_delay': 1, 'use_bucket_snapshot': False, 'use_doc_snapshot': False, 'random_delay_limit': 10, 'worker_options': {'num_cpus': 0.8}}\n", + "18:40:09 INFO - data factory data_ is using local data access: input_folder - output/04_exact_dedupe_out output_folder - output/05_fuzzy_dedupe_out\n", + "18:40:09 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:40:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "18:40:09 INFO - pipeline id pipeline_id\n", + "18:40:09 INFO - code location None\n", + "18:40:09 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", + "18:40:09 INFO - actor creation delay 0\n", + "18:40:09 INFO - job details {'job category': 'preprocessing', 'job name': 'fdedup', 'job type': 'ray', 'job id': 'job_id'}\n", + "2024-09-18 18:40:11,503\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - orchestrator started at 2024-09-18 18:40:12\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - Number of files is 2, source profile {'max_file_size': 0.009611129760742188, 'min_file_size': 0.009521484375, 'total_file_size': 0.019132614135742188}\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.208082581870258, 'object_store': 4.104041289538145}\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - starting run from the beginning\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - continuing from the very beginning\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - Fuzzy: num buckets 8, bucket length 8\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - created 1 bucket actors\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - created 1 minhash actors\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - Table preprocessing uses 1 readers\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - created 1 table processor actors\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:13 INFO - Completed 1 files in 0.014 min\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:13 INFO - Completed 1 files (50.0%) in 0.014 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:15 INFO - Completed processing 2 files in 0.047 min\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:15 INFO - creating minhash snapshots\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:16 INFO - minhash snapshots created\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:16 INFO - creating bucket snapshots\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:17 INFO - bucket snapshots created\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:17 INFO - created 1 document actors\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:17 INFO - created 1 bucket processor actors\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:17 INFO - created bucket processor invoker\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:17 INFO - added invoker to bucket collectors\n", + "\u001b[36m(BucketsHash pid=1191796)\u001b[0m 18:40:17 INFO - processing buckets 0 long, 53 short\n", + "\u001b[36m(BucketsHash pid=1191796)\u001b[0m 18:40:17 INFO - Done submitting long buckets\n", + "\u001b[36m(BucketsHashProcessorInvoker pid=1192188)\u001b[0m 18:40:18 INFO - Waiting bucket processing completion. Submitted requests 1\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:18 INFO - Done processing buckets in 0.011 min\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:18 INFO - creating document snapshots\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:19 INFO - document snapshots created\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:19 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:27 INFO - Completed processing 2 files in 0.131 min\n", + "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:27 INFO - done flushing in 0.004 sec\n", + "18:40:37 INFO - Completed execution in 0.462 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:5 completed successfully\n", + "CPU times: user 457 ms, sys: 296 ms, total: 753 ms\n", + "Wall time: 29.2 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "import os\n", + "import sys\n", + "\n", + "from data_processing.utils import ParamsUtils\n", + "from fdedup_transform_ray import FdedupRayTransformConfiguration\n", + "from data_processing_ray.runtime.ray import RayTransformLauncher\n", + "\n", + "# create parameters\n", + "\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", + "code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n", + "params = {\n", + " # where to run\n", + " \"run_locally\": True,\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # Orchestration parameters\n", + " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", + " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", + " # columns used\n", + " \"fdedup_doc_column\": \"contents\",\n", + " \"fdedup_id_column\": \"chunk_id\",\n", + " \"fdedup_cluster_column\": \"chunk_hash\",\n", + " # infrastructure\n", + " \"fdedup_bucket_cpu\": 0.3,\n", + " \"fdedup_doc_cpu\": 0.3,\n", + " \"fdedup_mhash_cpu\": 0.3,\n", + " \"fdedup_num_doc_actors\": 1,\n", + " \"fdedup_num_bucket_actors\": 1,\n", + " \"fdedup_num_minhash_actors\": 1,\n", + " \"fdedup_num_preprocessors\": 1,\n", + " # fuzzy parameters\n", + " \"fdedup_num_permutations\": 64,\n", + " \"fdedup_threshold\": 0.7, # (default 0.8)\n", + " \"fdedup_shingles_size\": 5,\n", + " \"fdedup_delimiters\": \" \"\n", + "}\n", + "\n", + "# Pass commandline params\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# launch\n", + "\n", + "launcher = RayTransformLauncher(FdedupRayTransformConfiguration())\n", + "\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"❌ Ray job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "a6f8cd11", + "metadata": { + "id": "a6f8cd11" + }, + "source": [ + "### 7.3 - Inspect Generated output" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "e899ad60", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 222 + }, + "id": "e899ad60", + "outputId": "17aaaea8-a106-4c9a-ceb3-6760d92f8b59" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (7, 18)\n", + "Output data dimensions (rows x columns)= (6, 18)\n", + "Duplicate chunks removed by fuzzy-dedupe: 1\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filenamecontentsdoc_jsonpathpage_numberbboxchunk_idremovedchunk_hash
0mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...6[]-1
1mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfBasic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7[]-1
2earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...0[]-1
3earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...1[]5
4earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...2[]-1
5earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...3[]-1
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "0 mars.pdf 1 0 11 \n", + "1 mars.pdf 1 0 11 \n", + "2 earth.pdf 1 0 11 \n", + "3 earth.pdf 1 0 11 \n", + "4 earth.pdf 1 0 11 \n", + "5 earth.pdf 1 0 11 \n", + "\n", + " document_id ext \\\n", + "0 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "1 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "2 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "3 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "4 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "5 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "1 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "2 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "3 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "4 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "5 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "1 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", + "2 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "3 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "4 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "5 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox chunk_id \\\n", + "0 1 [132.87440491, 500.84011841, 477.48345947, 534... 6 \n", + "1 1 [133.2026062, 482.90710449, 237.04431152, 493.... 7 \n", + "2 1 [132.87112427, 588.96014404, 479.40917969, 623... 0 \n", + "3 1 [133.20942688, 570.81555176, 375.57919312, 581... 1 \n", + "4 1 [132.91053772, 512.46295166, 477.84887695, 534... 2 \n", + "5 1 [133.30151367, 494.86206055, 240.17156982, 505... 3 \n", + "\n", + " removed chunk_hash \n", + "0 [] -1 \n", + "1 [] -1 \n", + "2 [] -1 \n", + "3 [] 5 \n", + "4 [] -1 \n", + "5 [] -1 " + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "print (\"Duplicate chunks removed by fuzzy-dedupe: \", (input_df.shape[0] - output_df.shape[0]))\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "ab7ea52b", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 81 + }, + "id": "ab7ea52b", + "outputId": "8e57385f-c925-4ac7-9e0d-ebc64e92530a" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontents
0mars.pdfMars\\nMars, the fourth planet from the Sun, is...
1mars.pdfBasic facts about Mars:\\n· Distance from the S...
2earth.pdfSolar System\\nOur solar system is a vast and f...
3earth.pdfSolar System\\nFor more details about our Solar...
4earth.pdfEarth\\nEarth is the third planet from the Sun....
5earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...
\n", + "
" + ], + "text/plain": [ + " filename contents\n", + "0 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", + "1 mars.pdf Basic facts about Mars:\\n· Distance from the S...\n", + "2 earth.pdf Solar System\\nOur solar system is a vast and f...\n", + "3 earth.pdf Solar System\\nFor more details about our Solar...\n", + "4 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", + "5 earth.pdf Earth\\nBasic facts about Earth:\\n· Distance fr..." + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output_df[['filename', 'contents']]" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "6bdd3515", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "6bdd3515", + "outputId": "00705442-b6ae-4238-b0f5-c94de690ecb4" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "========== mars.pdf ===========\n", + "-------Chunk 0------\n", + "Mars\n", + "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", + "-------\n", + "-------Chunk 1------\n", + "Basic facts about Mars:\n", + "· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", + "· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", + "· Moons: Two small moons, Phobos and Deimos.\n", + "-------\n", + "========== earth.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "-------\n", + "-------Chunk 1------\n", + "Solar System\n", + "For more details about our Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", + "Earth\n", + "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", + "-------\n", + "-------Chunk 3------\n", + "Earth\n", + "Basic facts about Earth:\n", + "· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", + "· Rotation Period: 24 hours (one day)\n", + "· Moons: One moon, called Luna or simply \"the Moon\".\n", + "-------\n" + ] + } + ], + "source": [ + "for f in output_df['filename'].unique():\n", + " print ('==========' , f, '===========')\n", + " chunks = output_df[output_df['filename'] == f]['contents']\n", + " for idx , chunk in enumerate(chunks):\n", + " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" + ] + }, + { + "cell_type": "markdown", + "id": "2b34d9c6", + "metadata": { + "id": "2b34d9c6" + }, + "source": [ + "### 7.4- Understanding the output\n", + "\n", + "So we started with 7 rows and ended up with 6. Fuzzy dedupe removed the following **very similar** chunk.\n", + "\n", + "These are pretty similar chunks except for the words 'the' and 'our'\n", + "\n", + "**earth.pdf**\n", + "\n", + "`For more details about *our* Solar system see Chapter 1.`\n", + "\n", + "**mars.pdf**\n", + "\n", + "`For more details about *the* Solar system see Chapter 1.`\n", + "\n", + "Pretty neat, eh? 👏\n", + "\n", + "### Configuring Fuzzy de-dupe\n", + "\n", + "You can tweak fuzzy dedupe by tweaking the following parameters\n", + "\n", + "```python\n", + "# fuzzy parameters\n", + " \"fdedup_num_permutations\": 64,\n", + " \"fdedup_threshold\": 0.7, # (default 0.8)\n", + " \"fdedup_shingles_size\": 5,\n", + " \"fdedup_delimiters\": \" \"\n", + "```\n", + "\n", + "In our case, we set `fdedup_threshold` parameter to 0.7. \n" + ] + }, + { + "cell_type": "markdown", + "id": "5370950a-2a3a-4143-8218-f9b4808099ba", + "metadata": { + "id": "5370950a-2a3a-4143-8218-f9b4808099ba" + }, + "source": [ + "## Step-8: Text encoding\n", + "\n", + "Encode text for the vector storage." + ] + }, + { + "cell_type": "markdown", + "id": "85aba685", + "metadata": { + "id": "85aba685" + }, + "source": [ + "### 8.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "20a153fa-fd56-401e-86be-4f7617affcc8", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "20a153fa-fd56-401e-86be-4f7617affcc8", + "outputId": "e1795167-9fac-4b7c-9417-f655c30848a1" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-6: Processing input='output/05_fuzzy_dedupe_out' --> output='output/06_embeddings_out'\n" + ] + } + ], + "source": [ + "STAGE = 6\n", + "\n", + "input_folder = output_fuzzy_dedupe_dir # previous output folder is the input folder for the current stage\n", + "output_folder = output_embeddings_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "c97545f4", + "metadata": { + "id": "c97545f4" + }, + "source": [ + "### 8.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "228df6b2-bc62-494b-9697-03ece98d7853", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "228df6b2-bc62-494b-9697-03ece98d7853", + "outputId": "f4c2cba4-aed0-4eee-873b-d1a8abf60cbd" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:40:39 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", + "18:40:39 INFO - pipeline id pipeline_id\n", + "18:40:39 INFO - code location None\n", + "18:40:39 INFO - data factory data_ is using local data access: input_folder - output/05_fuzzy_dedupe_out output_folder - output/06_embeddings_out\n", + "18:40:39 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:40:39 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "18:40:39 INFO - orchestrator text_encoder started at 2024-09-18 18:40:39\n", + "18:40:39 INFO - Number of files is 2, source profile {'max_file_size': 0.009204864501953125, 'min_file_size': 0.009014129638671875, 'total_file_size': 0.018218994140625}\n", + "18:40:41 INFO - Completed 1 files (50.0%) in 0.003 min\n", + "18:40:41 INFO - Completed 2 files (100.0%) in 0.003 min\n", + "18:40:41 INFO - Done processing 2 files, waiting for flush() completion.\n", + "18:40:41 INFO - done flushing in 0.0 sec\n", + "18:40:41 INFO - Completed execution in 0.032 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:6 completed successfully\n", + "CPU times: user 816 ms, sys: 204 ms, total: 1.02 s\n", + "Wall time: 2.53 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from text_encoder_local_python import TextEncoderPythonTransformConfiguration\n", + "\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "params = {\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # text_encoder\n", + " \"text_encoder_model_name\": MY_CONFIG.EMBEDDING_MODEL,\n", + "}\n", + "\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "# create launcher\n", + "launcher = PythonTransformLauncher(TextEncoderPythonTransformConfiguration())\n", + "\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"❌ Job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "b734852c", + "metadata": { + "id": "b734852c" + }, + "source": [ + "### 8.3 - Inspect Generated output\n", + "\n", + "You will see a column called `embeddings` added at the end. This the text content converted into vectors or embeddings. We used the model `sentence-transformers/all-MiniLM-L6-v2`" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "7b1c1d09", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 205 + }, + "id": "7b1c1d09", + "outputId": "86c49244-9f9f-4116-fb17-c27ff6c29bc7" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (6, 18)\n", + "Output data dimensions (rows x columns)= (6, 19)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filenamecontentsdoc_jsonpathpage_numberbboxchunk_idremovedchunk_hashembeddings
0mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...6[]-1[0.07728295, 0.024970993, -0.043180738, 0.0580...
1mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfBasic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7[]-1[0.10598018, 0.025460618, 0.023627337, 0.03905...
2earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...0[]-1[0.0077404436, -0.02055944, 0.026426593, 0.011...
3earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...1[]5[-0.062105548, -0.0053322907, 0.031277698, 0.0...
4earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...2[]-1[0.072435796, -0.058001805, -0.019771898, -0.0...
5earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...3[]-1[0.091821924, 0.015197902, 0.07716932, 0.01711...
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "0 mars.pdf 1 0 11 \n", + "1 mars.pdf 1 0 11 \n", + "2 earth.pdf 1 0 11 \n", + "3 earth.pdf 1 0 11 \n", + "4 earth.pdf 1 0 11 \n", + "5 earth.pdf 1 0 11 \n", + "\n", + " document_id ext \\\n", + "0 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "1 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", + "2 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "3 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "4 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "5 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "1 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", + "2 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "3 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "4 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "5 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "1 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", + "2 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "3 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "4 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "5 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox chunk_id \\\n", + "0 1 [132.87440491, 500.84011841, 477.48345947, 534... 6 \n", + "1 1 [133.2026062, 482.90710449, 237.04431152, 493.... 7 \n", + "2 1 [132.87112427, 588.96014404, 479.40917969, 623... 0 \n", + "3 1 [133.20942688, 570.81555176, 375.57919312, 581... 1 \n", + "4 1 [132.91053772, 512.46295166, 477.84887695, 534... 2 \n", + "5 1 [133.30151367, 494.86206055, 240.17156982, 505... 3 \n", + "\n", + " removed chunk_hash embeddings \n", + "0 [] -1 [0.07728295, 0.024970993, -0.043180738, 0.0580... \n", + "1 [] -1 [0.10598018, 0.025460618, 0.023627337, 0.03905... \n", + "2 [] -1 [0.0077404436, -0.02055944, 0.026426593, 0.011... \n", + "3 [] 5 [-0.062105548, -0.0053322907, 0.031277698, 0.0... \n", + "4 [] -1 [0.072435796, -0.058001805, -0.019771898, -0.0... \n", + "5 [] -1 [0.091821924, 0.015197902, 0.07716932, 0.01711... " + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "f5e12630-be6b-4188-a925-77117155617b", + "metadata": { + "id": "f5e12630-be6b-4188-a925-77117155617b" + }, + "source": [ + "## Step-9: Copy output to final output dir" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", + "outputId": "aa667c65-8421-4d4d-f57e-47ccc4ea41ad" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Copied output from 'output/06_embeddings_out' --> 'output/output_final'\n" + ] + } + ], + "source": [ + "import shutil\n", + "\n", + "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER_FINAL, ignore_errors=True)\n", + "shutil.copytree(src=output_folder, dst=MY_CONFIG.OUTPUT_FOLDER_FINAL)\n", + "\n", + "print (f\"✅ Copied output from '{output_folder}' --> '{MY_CONFIG.OUTPUT_FOLDER_FINAL}'\")" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "0a1ed94698ca4e4291c553929e0ca66c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "2eea7bc810e54eaeb325136352b71e66": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "3077f04af3a9447ab98717bd3131cd8f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "4f63bfad92b64e7bae18e720376d402d": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_709685da1c6c4164bed658357a2191bf", + "max": 7, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_0a1ed94698ca4e4291c553929e0ca66c", + "value": 7 + } + }, + "5dbc6889a9c243c5a922f8cc5f1a704c": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "6957a659451b46dab702c1c62fa9cdd2": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_5dbc6889a9c243c5a922f8cc5f1a704c", + "placeholder": "​", + "style": "IPY_MODEL_d6e520e4da004c818031ccfcc3588e5d", + "value": " 7/7 [00:00<00:00, 221.60it/s]" + } + }, + "709685da1c6c4164bed658357a2191bf": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "7616f1b493e1461c9fd1319fae3bc10b": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_ebc626c0750c470db6789b26acf15f60", + "placeholder": "​", + "style": "IPY_MODEL_3077f04af3a9447ab98717bd3131cd8f", + "value": "Fetching 7 files: 100%" + } + }, + "8226b2522ce446f6bd3a36c4e227370c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_7616f1b493e1461c9fd1319fae3bc10b", + "IPY_MODEL_4f63bfad92b64e7bae18e720376d402d", + "IPY_MODEL_6957a659451b46dab702c1c62fa9cdd2" + ], + "layout": "IPY_MODEL_2eea7bc810e54eaeb325136352b71e66" + } + }, + "d6e520e4da004c818031ccfcc3588e5d": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "ebc626c0750c470db6789b26acf15f60": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + } + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/notebooks/intro/dpk_intro_1_ray.ipynb b/examples/notebooks/intro/dpk_intro_1_ray.ipynb new file mode 100644 index 000000000..7ce746c67 --- /dev/null +++ b/examples/notebooks/intro/dpk_intro_1_ray.ipynb @@ -0,0 +1,3909 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866", + "metadata": { + "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866" + }, + "source": [ + "# Data Prep Kit Demo 1 - Ray Version\n", + "\n", + "This notebook will introduce DPK and showcase some of it's capabilities.\n", + "\n", + "Here is the workflow\n", + "\n", + "![](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/notebooks/intro/images/data-prep-kit-3-workflow.png)\n" + ] + }, + { + "cell_type": "markdown", + "id": "b15976e3", + "metadata": { + "id": "b15976e3" + }, + "source": [ + "## How to run this notebook\n", + "\n", + "Two options:\n", + "\n", + "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/main/examples/notebooks/intro/dpk_intro_1_ray.ipynb)\n", + "- **Option 2 - Local python dev environment:** Setup using this [guide](../../../README.md#-getting-started)\n", + "\n", + "The notebook will work as in both environments" + ] + }, + { + "cell_type": "markdown", + "id": "eb8b0d5c", + "metadata": { + "id": "eb8b0d5c" + }, + "source": [ + "## Step-1: Inspect the Data\n", + "\n", + "We will use simple PDFs about Solar system. The files are [here](https://github.com/sujee/data-prep-kit-examples/tree/main/data/solar-system)\n", + "\n", + "- [earth.pdf](https://github.com/sujee/data-prep-kit-examples/blob/main/data/solar-system/earth.pdf)\n", + "- [mars.pdf](https://github.com/sujee/data-prep-kit-examples/blob/main/data/solar-system/mars.pdf)\n", + "\n", + "### (Optional) How to create PDFs?\n", + "\n", + "If you like to play around with various inputs files, follow these steps to re-generate PDFs.\n", + "\n", + "**Option 1 (Easiest): Use a word editor or google docs editor**\n", + "\n", + "Write your content and export as PDF\n", + "\n", + "\n", + "**Option 2: markdown -> pdf**\n", + "\n", + "First edit the markdown files using any text editor.\n", + "\n", + "Then use [pandoc](https://pandoc.org/) to convert them to pdfs.\n", + "\n", + "```bash\n", + "pandoc earth.md -o earth.pdf\n", + "pandoc mars.md -o mars.pdf\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "id": "39a0ab6e", + "metadata": { + "id": "39a0ab6e" + }, + "source": [ + "## Step-2: Figure out Runtime Environment\n", + "\n", + "### 2.1 - Determine runtime\n", + "\n", + "Determine if we are running on Google colab or local python environment" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "1fe354b7", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "1fe354b7", + "outputId": "6fe04a4c-8092-49bb-f4ee-ffdcd42b6c11" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NOT in Colab\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", + " print(\"Running in Colab\")\n", + " RUNNING_IN_COLAB = True\n", + "else:\n", + " print(\"NOT in Colab\")\n", + " RUNNING_IN_COLAB = False" + ] + }, + { + "cell_type": "markdown", + "id": "8e7c104b", + "metadata": { + "id": "8e7c104b" + }, + "source": [ + "### 2.2 -Download Data if running on Google Colab" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "3309799e", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "3309799e", + "outputId": "5af8cfbc-346d-41bd-c14e-c917d0f403f3" + }, + "outputs": [], + "source": [ + "if RUNNING_IN_COLAB:\n", + " !mkdir -p 'input'\n", + " !wget -O 'input/earth.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/input/solar-system/earth.pdf'\n", + " !wget -O 'input/mars.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/input/solar-system/mars.pdf'\n", + " !wget -O 'utils.py' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/my_utils.py'" + ] + }, + { + "cell_type": "markdown", + "id": "a5dc2b68", + "metadata": { + "id": "a5dc2b68" + }, + "source": [ + "### 2.3 - Install dependencies if running on Google Colab" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "1fcec577", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "1fcec577", + "outputId": "93aa2df3-0cf5-4b04-84bb-6803bbf46df6" + }, + "outputs": [], + "source": [ + "if RUNNING_IN_COLAB:\n", + " ! pip install --default-timeout=100 \\\n", + " data-prep-toolkit[ray]==0.2.2.dev1 \\\n", + " data-prep-toolkit-transforms[ray,all]==0.2.2.dev1 \\\n", + " deepsearch-toolkit" + ] + }, + { + "cell_type": "markdown", + "id": "243322b8", + "metadata": { + "id": "243322b8" + }, + "source": [ + "### 2.4 - Restart Runtime\n", + "\n", + "After installing dependencies, be sure restart runtime, so libraries will be loaded\n", + "\n", + "You do this by going to **`Runtime --> Restart Session`**\n", + "\n", + "Then you can continue to the next step (no need to re-run the notebook)" + ] + }, + { + "cell_type": "markdown", + "id": "e8b10be1", + "metadata": { + "id": "e8b10be1" + }, + "source": [ + "## Step-2: Configuration" + ] + }, + { + "cell_type": "markdown", + "id": "356c66f7", + "metadata": { + "id": "356c66f7" + }, + "source": [ + "### 2.1 - Basic Config" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "e4YMZrBuFycl", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "e4YMZrBuFycl", + "outputId": "8a316776-582c-4d01-80de-cd530081a080" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NOT in Colab\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", + " print(\"Running in Colab\")\n", + " RUNNING_IN_COLAB = True\n", + "else:\n", + " print(\"NOT in Colab\")\n", + " RUNNING_IN_COLAB = False" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "33345487", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "33345487", + "outputId": "47dca359-2740-493d-83eb-1291617d3db1" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "MY_CONFIG.RAY_RUNTIME_WORKERS: 2\n", + "MY_CONFIG.RAY_NUM_CPUS: 1\n", + "MY_CONFIG.RAY_MEMORY_GB: 2\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "## Configuration\n", + "class MyConfig:\n", + " pass\n", + "\n", + "MY_CONFIG = MyConfig ()\n", + "\n", + "if RUNNING_IN_COLAB:\n", + " MY_CONFIG.INPUT_DATA_DIR = 'input'\n", + "else:\n", + " MY_CONFIG.INPUT_DATA_DIR = os.path.join (os.path.abspath (''), '..', 'data', 'solar-system')\n", + "MY_CONFIG.OUTPUT_FOLDER = \"output\"\n", + "MY_CONFIG.OUTPUT_FOLDER_FINAL = os.path.join(MY_CONFIG.OUTPUT_FOLDER , \"output_final\")\n", + "\n", + "## Embedding model\n", + "MY_CONFIG.EMBEDDING_MODEL = 'sentence-transformers/all-MiniLM-L6-v2'\n", + "\n", + "## RAY CONFIGURATION\n", + "### For local runs, we can use more parallelism\n", + "### For google colab, be conservative\n", + "\n", + "if RUNNING_IN_COLAB:\n", + " MY_CONFIG.RAY_RUNTIME_WORKERS = 2\n", + " MY_CONFIG.RAY_NUM_CPUS = 0.3\n", + " MY_CONFIG.RAY_MEMORY_GB = 2 # GB\n", + "else: # local run\n", + " num_cpus_available = os.cpu_count()\n", + " # print (num_cpus_available)\n", + " MY_CONFIG.RAY_NUM_CPUS = 1\n", + " MY_CONFIG.RAY_MEMORY_GB = 2 # GB\n", + " # MY_CONFIG.RAY_RUNTIME_WORKERS = num_cpus_available // 3\n", + " MY_CONFIG.RAY_RUNTIME_WORKERS = 2\n", + "\n", + "print ('MY_CONFIG.RAY_RUNTIME_WORKERS:', MY_CONFIG.RAY_RUNTIME_WORKERS)\n", + "print ('MY_CONFIG.RAY_NUM_CPUS:', MY_CONFIG.RAY_NUM_CPUS)\n", + "print ('MY_CONFIG.RAY_MEMORY_GB:', MY_CONFIG.RAY_MEMORY_GB)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "b15e6827", + "metadata": { + "id": "b15e6827" + }, + "outputs": [], + "source": [ + "## Add parent dir to path\n", + "import os,sys\n", + "\n", + "this_dir = os.path.abspath('')\n", + "parent_dir = os.path.dirname(this_dir)\n", + "sys.path.append (os.path.abspath (parent_dir))" + ] + }, + { + "cell_type": "markdown", + "id": "72510ae6-48b0-4b88-9e13-a623281c3a63", + "metadata": { + "id": "72510ae6-48b0-4b88-9e13-a623281c3a63" + }, + "source": [ + "### 2.2 - Setup input/outpur directories" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "60ac8bee-0960-4309-b225-d7a211b14262", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "60ac8bee-0960-4309-b225-d7a211b14262", + "outputId": "704d5f45-5d49-43b0-afeb-1dddf2aa326d" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Cleared output directory\n" + ] + } + ], + "source": [ + "import os, sys\n", + "import shutil\n", + "\n", + "if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):\n", + " raise Exception (f\"❌ Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found\")\n", + "\n", + "output_parquet_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '01_parquet_out')\n", + "output_chunk_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_chunk_out')\n", + "output_docid_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '03_docid_out')\n", + "output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '04_exact_dedupe_out')\n", + "output_fuzzy_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '05_fuzzy_dedupe_out')\n", + "output_embeddings_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '06_embeddings_out')\n", + "\n", + "## clear output folder\n", + "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)\n", + "shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)\n", + "\n", + "print (\"✅ Cleared output directory\")" + ] + }, + { + "cell_type": "markdown", + "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb", + "metadata": { + "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb" + }, + "source": [ + "## Step-3: pdf2parquet - Convert data from PDF to Parquet\n", + "\n", + "This step is reading the input folder containing all PDF files and ingest them in a parquet table using the [Docling package](https://github.com/DS4SD/docling).\n", + "The documents are converted into a JSON format which allows to easily chunk it in the later steps.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "c0c574c4-9dc4-4dab-9ad6-b5338207e67a", + "metadata": { + "id": "c0c574c4-9dc4-4dab-9ad6-b5338207e67a" + }, + "source": [ + "### 3.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "482605b2-d814-456d-9195-49a2ec454ef0", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "482605b2-d814-456d-9195-49a2ec454ef0", + "outputId": "5ef25857-46d4-463e-f847-369d18cb2d8d" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-1: Processing input='/home/sujee/my-stuff/projects/ai-alliance/data-prep-kit-examples/dpk-intro/../data/solar-system' --> output='output/01_parquet_out'\n" + ] + } + ], + "source": [ + "STAGE = 1\n", + "\n", + "input_folder = MY_CONFIG.INPUT_DATA_DIR\n", + "output_folder = output_parquet_dir\n", + "\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b", + "metadata": { + "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b" + }, + "source": [ + "### 3.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", + "outputId": "7a069b9a-1159-4993-d2b0-b26b16235f6b" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:49:32 INFO - Running locally\n", + "18:49:32 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", + "18:49:32 INFO - data factory data_ is using local data access: input_folder - /home/sujee/my-stuff/projects/ai-alliance/data-prep-kit-examples/dpk-intro/../data/solar-system output_folder - output/01_parquet_out\n", + "18:49:32 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:49:32 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", + "18:49:32 INFO - pipeline id pipeline_id\n", + "18:49:32 INFO - code location None\n", + "18:49:32 INFO - number of workers 2 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}\n", + "18:49:32 INFO - actor creation delay 0\n", + "18:49:32 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}\n", + "2024-09-18 18:49:33,959\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:37 INFO - orchestrator started at 2024-09-18 18:49:37\n", + "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:37 INFO - Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}\n", + "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:37 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.135861206799746, 'object_store': 4.06793060246855}\n", + "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:37 INFO - Number of workers - 2 with {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:37 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(RayTransformFileProcessor pid=1212179)\u001b[0m 18:49:40 INFO - Initializing models\n", + "Fetching 7 files: 100%|██████████| 7/7 [00:00<00:00, 167772.16it/s]\n", + "\u001b[36m(RayTransformFileProcessor pid=1212180)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n", + "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:46 INFO - Completed processing 2 files in 0.14 min\n", + "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:46 INFO - done flushing in 0.001 sec\n", + "\u001b[36m(RayTransformFileProcessor pid=1212180)\u001b[0m 18:49:40 INFO - Initializing models\n", + "18:49:56 INFO - Completed execution in 0.4 min, execution result 0\n", + "Fetching 7 files: 100%|██████████| 7/7 [00:00<00:00, 38031.25it/s]\n", + "\u001b[36m(RayTransformFileProcessor pid=1212179)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:1 completed successfully\n", + "CPU times: user 4.1 s, sys: 1.17 s, total: 5.27 s\n", + "Wall time: 28.2 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "import ast\n", + "import os\n", + "import sys\n", + "\n", + "from pdf2parquet_transform import (\n", + " pdf2parquet_contents_type_cli_param,\n", + " pdf2parquet_contents_types,\n", + ")\n", + "from data_processing_ray.runtime.ray import RayTransformLauncher\n", + "from pdf2parquet_transform_python import Pdf2ParquetPythonTransformConfiguration\n", + "from pdf2parquet_transform_ray import Pdf2ParquetRayTransformConfiguration\n", + "\n", + "from data_processing.utils import GB, ParamsUtils\n", + "\n", + "\n", + "# create parameters\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS, \"memory\": MY_CONFIG.RAY_MEMORY_GB * GB}\n", + "ingest_config = {\n", + " pdf2parquet_contents_type_cli_param: pdf2parquet_contents_types.JSON,\n", + "}\n", + "\n", + "params = {\n", + " # where to run\n", + " \"run_locally\": True,\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " \"data_files_to_use\": ast.literal_eval(\"['.pdf']\"),\n", + " # orchestrator\n", + " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", + " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", + "}\n", + "\n", + "\n", + "sys.argv = ParamsUtils.dict_to_req(d=(params | ingest_config))\n", + "# create launcher\n", + "launcher = RayTransformLauncher(Pdf2ParquetRayTransformConfiguration())\n", + "# launcher = PythonTransformLauncher(Pdf2ParquetPythonTransformConfiguration())\n", + "# launch\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"❌ Ray job failed\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "5ca790e0", + "metadata": { + "id": "5ca790e0" + }, + "source": [ + "### 3.3 - Inspect Generated output\n", + "\n", + "Here we should see one entry per input file processed." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "fe59563d", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 254 + }, + "id": "fe59563d", + "outputId": "9ba799f3-a183-4467-d50f-44dbbc86d19a" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Output dimensions (rows x columns)= (2, 12)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filename
0mars.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdf
1earth.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdf
\n", + "
" + ], + "text/plain": [ + " filename contents num_pages \\\n", + "0 mars.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", + "1 earth.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", + "\n", + " num_tables num_doc_elements document_id ext \\\n", + "0 0 11 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "1 0 11 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \n", + "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "1 2024-09-18T18:49:45.937701 1.966178 earth.pdf " + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Output dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(5)\n", + "\n", + "## To display certain columns\n", + "#parquet_df[['column1', 'column2', 'column3']].head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "e5058a21", + "metadata": { + "id": "e5058a21" + }, + "source": [ + "\n", + "### 3.4 - Understand the output\n", + "\n", + "Here are some interesting attributes to note:\n", + "\n", + "- **filename** : original filename\n", + "- **contents** : text\n", + "- **document_id**: unique id (UUID) assignd to this document\n", + "- **hash** : hash of document\n", + "- **pdf_convert_time** : time to convert this pdf in seconds\n", + "\n", + "Let's inspect the **contents** column. See how the text is being divided up!" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "f870e624", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "f870e624", + "outputId": "e759dddf-64ac-4b55-a9bf-d0722620d6ab" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'_name': '',\n", + " 'description': {'logs': []},\n", + " 'equations': [],\n", + " 'figures': [],\n", + " 'file-info': {'#-pages': 1,\n", + " 'document-hash': '1a83f43f3a202e3f203c1263e36961ecc45d401aad488f638fc5559a584333b2',\n", + " 'filename': 'mars.pdf',\n", + " 'page-hashes': [{'hash': '551fe7a9bde2a9302f150c0a79a13fcc0868fcf73ac6afb80be645c1174734a0',\n", + " 'model': 'default',\n", + " 'page': 1}]},\n", + " 'footnotes': [],\n", + " 'main-text': [{'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.35137939,\n", + " 654.45184326,\n", + " 169.88169861,\n", + " 667.98492432],\n", + " 'page': 1,\n", + " 'span': [0, 4]}],\n", + " 'text': 'Mars',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.09541321,\n", + " 630.68127441,\n", + " 210.66503906,\n", + " 642.34405518],\n", + " 'page': 1,\n", + " 'span': [0, 12]}],\n", + " 'text': 'Solar System',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.84518433,\n", + " 588.96014404,\n", + " 479.40917969,\n", + " 623.02520752],\n", + " 'page': 1,\n", + " 'span': [0, 205]}],\n", + " 'text': 'Our solar system is a vast and fascinating expanse, '\n", + " 'comprising eight planets, five dwarf planets, '\n", + " 'numerous moons, asteroids, comets, and other '\n", + " 'celestial bodies. At its center lies the star we call '\n", + " 'the Sun.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [133.18510437,\n", + " 570.83258057,\n", + " 374.99838257,\n", + " 581.07043457],\n", + " 'page': 1,\n", + " 'span': [0, 54]}],\n", + " 'text': 'For more details about the Solar system see Chapter '\n", + " '1.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.22866821,\n", + " 542.98168945,\n", + " 163.86282349,\n", + " 554.45288086],\n", + " 'page': 1,\n", + " 'span': [0, 4]}],\n", + " 'text': 'Mars',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.87440491,\n", + " 500.84011841,\n", + " 477.48345947,\n", + " 534.55810547],\n", + " 'page': 1,\n", + " 'span': [0, 196]}],\n", + " 'text': 'Mars, the fourth planet from the Sun, is a cold, '\n", + " 'desert world with a thin atmosphere composed '\n", + " 'primarily of carbon dioxide. Its reddish hue comes '\n", + " 'from iron oxide, or rust, prevalent on its surface.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.2026062,\n", + " 482.90710449,\n", + " 237.04431152,\n", + " 493.07443237],\n", + " 'page': 1,\n", + " 'span': [0, 23]}],\n", + " 'text': 'Basic facts about Mars:',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 453.019104,\n", + " 477.48171997,\n", + " 474.9703064],\n", + " 'page': 1,\n", + " 'span': [0, 78]}],\n", + " 'text': '· Distance from the Sun: Average of 228 million '\n", + " 'kilometers (142 million miles)',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 440.79351807,\n", + " 431.73287964,\n", + " 451.2142334],\n", + " 'page': 1,\n", + " 'span': [0, 64]}],\n", + " 'text': '· Rotation Period: 24.6 hours (one Martian day - '\n", + " 'called a \"sol\")',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 429.10913086,\n", + " 365.9559021,\n", + " 438.83737183],\n", + " 'page': 1,\n", + " 'span': [0, 44]}],\n", + " 'text': '· Moons: Two small moons, Phobos and Deimos.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Page-footer',\n", + " 'prov': [{'bbox': [303.13299561,\n", + " 87.20314026,\n", + " 308.11428833,\n", + " 96.51646423],\n", + " 'page': 1,\n", + " 'span': [0, 1]}],\n", + " 'text': '1',\n", + " 'type': 'page-footer'}],\n", + " 'page-dimensions': [{'height': 792.0, 'page': 1, 'width': 612.0}],\n", + " 'page-footers': [],\n", + " 'page-headers': [],\n", + " 'tables': [],\n", + " 'type': 'pdf-document'}\n" + ] + } + ], + "source": [ + "import pprint\n", + "import json\n", + "\n", + "pprint.pprint (json.loads(output_df.iloc[0, ]['contents']))\n", + "# json.loads(output_df.iloc[0, ]['contents'])" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "e1a10c2d", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "e1a10c2d", + "outputId": "d9eab8cc-79ac-4f5e-99f3-596e357a2e39" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'_name': '',\n", + " 'description': {'logs': []},\n", + " 'equations': [],\n", + " 'figures': [],\n", + " 'file-info': {'#-pages': 1,\n", + " 'document-hash': '7401ae81637dbb89e7040dcd5945bbfb75ff8648bb761c69f8a1595e86538748',\n", + " 'filename': 'earth.pdf',\n", + " 'page-hashes': [{'hash': 'ca802e4bd5a3301792808caea2a47db51f0520888875b77fc230c99ee851c19b',\n", + " 'model': 'default',\n", + " 'page': 1}]},\n", + " 'footnotes': [],\n", + " 'main-text': [{'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.30961609,\n", + " 654.45184326,\n", + " 174.04208374,\n", + " 667.93347168],\n", + " 'page': 1,\n", + " 'span': [0, 5]}],\n", + " 'text': 'Earth',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.12528992,\n", + " 630.69073486,\n", + " 210.66503906,\n", + " 642.27935791],\n", + " 'page': 1,\n", + " 'span': [0, 12]}],\n", + " 'text': 'Solar System',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.87112427,\n", + " 588.96014404,\n", + " 479.40917969,\n", + " 623.04595947],\n", + " 'page': 1,\n", + " 'span': [0, 205]}],\n", + " 'text': 'Our solar system is a vast and fascinating expanse, '\n", + " 'comprising eight planets, five dwarf planets, '\n", + " 'numerous moons, asteroids, comets, and other '\n", + " 'celestial bodies. At its center lies the star we call '\n", + " 'the Sun.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [133.20942688,\n", + " 570.81555176,\n", + " 375.57919312,\n", + " 581.08459473],\n", + " 'page': 1,\n", + " 'span': [0, 54]}],\n", + " 'text': 'For more details about our Solar system see Chapter '\n", + " '1.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.15542603,\n", + " 542.98168945,\n", + " 167.32983398,\n", + " 554.36669922],\n", + " 'page': 1,\n", + " 'span': [0, 5]}],\n", + " 'text': 'Earth',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.91053772,\n", + " 512.46295166,\n", + " 477.84887695,\n", + " 534.48431396],\n", + " 'page': 1,\n", + " 'span': [0, 107]}],\n", + " 'text': \"Earth is the third planet from the Sun. It's our home \"\n", + " 'planet. Earth is the only place we know of with life.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [133.30151367,\n", + " 494.86206055,\n", + " 240.17156982,\n", + " 505.07229614],\n", + " 'page': 1,\n", + " 'span': [0, 24]}],\n", + " 'text': 'Basic facts about Earth:',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 464.97409058,\n", + " 477.47979736,\n", + " 487.02810669],\n", + " 'page': 1,\n", + " 'span': [0, 79]}],\n", + " 'text': '· Distance from the Sun: Average of 149.6 million '\n", + " 'kilometers (93 million miles)',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 452.86901855,\n", + " 317.90722656,\n", + " 463.24041748],\n", + " 'page': 1,\n", + " 'span': [0, 37]}],\n", + " 'text': '· Rotation Period: 24 hours (one day)',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 440.71496582,\n", + " 396.66357422,\n", + " 451.19915771],\n", + " 'page': 1,\n", + " 'span': [0, 52]}],\n", + " 'text': '· Moons: One moon, called Luna or simply \"the Moon\".',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Page-footer',\n", + " 'prov': [{'bbox': [303.13299561,\n", + " 87.20314026,\n", + " 308.11428833,\n", + " 96.53633118],\n", + " 'page': 1,\n", + " 'span': [0, 1]}],\n", + " 'text': '1',\n", + " 'type': 'page-footer'}],\n", + " 'page-dimensions': [{'height': 792.0, 'page': 1, 'width': 612.0}],\n", + " 'page-footers': [],\n", + " 'page-headers': [],\n", + " 'tables': [],\n", + " 'type': 'pdf-document'}\n" + ] + } + ], + "source": [ + "pprint.pprint (json.loads(output_df.iloc[1, ]['contents']))" + ] + }, + { + "cell_type": "markdown", + "id": "72274586", + "metadata": { + "id": "72274586" + }, + "source": [ + "## Step-4: Doc chunks\n", + "\n", + "In the previous step, we have extracted text from oru PDFs. But we have the content of entire file as 'one row' in our parquet output.\n", + "\n", + "In this step, we are going to split the documents in chunks, according to their layout segmentation.\n", + "\n", + "This transform uses [Quackling](https://github.com/DS4SD/quackling) `HierarchicalChunker`\n", + "to chunk according to the document layout segmentation, i.e. respecting the original document components as paragraphs, tables, enumerations, etc.\n", + "It relies on documents converted with the Docling library in the [pdf2parquet transform](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/python/README.md) using the option `contents_type: \"application/json\"`,\n", + "which provides the required JSON structure." + ] + }, + { + "cell_type": "markdown", + "id": "96198fa6", + "metadata": { + "id": "96198fa6" + }, + "source": [ + "### 4.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "305f00a3", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "305f00a3", + "outputId": "d680cc28-2d3a-4793-9373-c56635a308c9" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-2: Processing input='output/01_parquet_out' --> output='output/02_chunk_out'\n" + ] + } + ], + "source": [ + "STAGE = 2\n", + "\n", + "input_folder = output_parquet_dir # previous output folder is the input folder for the current stage\n", + "output_folder = output_chunk_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "369f2cd1", + "metadata": { + "id": "369f2cd1" + }, + "source": [ + "### 4.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "5b7b18d5", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "5b7b18d5", + "outputId": "7151d997-74f1-42fd-90a2-0124c6a68c84" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:49:58 INFO - Running locally\n", + "18:49:58 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'output_chunk_column_name': 'contents', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox'}\n", + "18:49:58 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", + "18:49:58 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:49:58 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "18:49:58 INFO - pipeline id pipeline_id\n", + "18:49:58 INFO - code location None\n", + "18:49:58 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", + "18:49:58 INFO - actor creation delay 0\n", + "18:49:58 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_chunk', 'job type': 'ray', 'job id': 'job_id'}\n", + "2024-09-18 18:50:00,178\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:02 INFO - orchestrator started at 2024-09-18 18:50:02\n", + "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:02 INFO - Number of files is 2, source profile {'max_file_size': 0.02239513397216797, 'min_file_size': 0.02167987823486328, 'total_file_size': 0.04407501220703125}\n", + "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:02 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.085193634033203, 'object_store': 4.042596817016602}\n", + "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:02 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:02 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:04 INFO - Completed processing 2 files in 0.033 min\n", + "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:04 INFO - done flushing in 0.001 sec\n", + "18:50:14 INFO - Completed execution in 0.271 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:2 completed successfully\n", + "CPU times: user 917 ms, sys: 285 ms, total: 1.2 s\n", + "Wall time: 18.6 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from data_processing_ray.runtime.ray import RayTransformLauncher\n", + "from doc_chunk_transform_ray import DocChunkRayTransformConfiguration\n", + "\n", + "\n", + "# Prepare the commandline params\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", + "params = {\n", + " # where to run\n", + " \"run_locally\": True,\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # orchestrator\n", + " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", + " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", + " # doc_chunk arguments\n", + " # ...\n", + "}\n", + "\n", + "# Pass the commandline params\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# create launcher\n", + "launcher = RayTransformLauncher(DocChunkRayTransformConfiguration())\n", + "# launch\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"❌ Ray job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "213afdf6", + "metadata": { + "id": "213afdf6" + }, + "source": [ + "### 4.3 - Inspect Generated output\n", + "\n", + "We would see documents are split into many chunks" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "d8138d43", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 893 + }, + "id": "d8138d43", + "outputId": "3cbc98f8-1dcb-4a32-9259-f801a83cf241" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Files processed : 2\n", + "Chunks created : 8\n", + "Input data dimensions (rows x columns)= (2, 12)\n", + "Output data dimensions (rows x columns)= (8, 15)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filenamecontentsdoc_jsonpathpage_numberbbox
0mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...
1mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...
2mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...
3mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfBasic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....
4earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...
5earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...
6earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...
7earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "0 mars.pdf 1 0 11 \n", + "1 mars.pdf 1 0 11 \n", + "2 mars.pdf 1 0 11 \n", + "3 mars.pdf 1 0 11 \n", + "4 earth.pdf 1 0 11 \n", + "5 earth.pdf 1 0 11 \n", + "6 earth.pdf 1 0 11 \n", + "7 earth.pdf 1 0 11 \n", + "\n", + " document_id ext \\\n", + "0 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "1 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "2 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "3 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "4 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "5 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "6 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "7 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "1 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "2 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "3 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "4 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "5 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "6 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "7 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "3 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", + "4 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "5 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "7 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox \n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", + "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "7 1 [133.30151367, 494.86206055, 240.17156982, 505... " + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (f\"Files processed : {input_df.shape[0]:,}\")\n", + "print (f\"Chunks created : {output_df.shape[0]:,}\")\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "9e9ca75c", + "metadata": { + "id": "9e9ca75c" + }, + "source": [ + "### 4.4 - Understanding the Output\n", + "\n", + "Here we see 2 PDF files are split into 6 chunks. Basically we see the documents are being split along 'natural boundaris' - paragraphs and bullet points\n", + "\n", + "See how **document_id** is carried throughout. This helps us identify original documents.\n", + "\n", + "Also note **contents** is now plain text (not JSON as before)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "3090c950", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 300 + }, + "id": "3090c950", + "outputId": "fa82f54b-53a3-4447-a4ca-2fe92dea452a" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontents
0mars.pdfSolar System\\nOur solar system is a vast and f...
1mars.pdfSolar System\\nFor more details about the Solar...
2mars.pdfMars\\nMars, the fourth planet from the Sun, is...
3mars.pdfBasic facts about Mars:\\n· Distance from the S...
4earth.pdfSolar System\\nOur solar system is a vast and f...
5earth.pdfSolar System\\nFor more details about our Solar...
6earth.pdfEarth\\nEarth is the third planet from the Sun....
7earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...
\n", + "
" + ], + "text/plain": [ + " filename contents\n", + "0 mars.pdf Solar System\\nOur solar system is a vast and f...\n", + "1 mars.pdf Solar System\\nFor more details about the Solar...\n", + "2 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", + "3 mars.pdf Basic facts about Mars:\\n· Distance from the S...\n", + "4 earth.pdf Solar System\\nOur solar system is a vast and f...\n", + "5 earth.pdf Solar System\\nFor more details about our Solar...\n", + "6 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", + "7 earth.pdf Earth\\nBasic facts about Earth:\\n· Distance fr..." + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output_df[['filename', 'contents']]" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "d5f151ae", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "d5f151ae", + "outputId": "87a8d7a0-0bc0-4735-9edb-57e9c9e5a8e1" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "========== mars.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "-------\n", + "-------Chunk 1------\n", + "Solar System\n", + "For more details about the Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", + "Mars\n", + "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", + "-------\n", + "-------Chunk 3------\n", + "Basic facts about Mars:\n", + "· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", + "· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", + "· Moons: Two small moons, Phobos and Deimos.\n", + "-------\n", + "========== earth.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "-------\n", + "-------Chunk 1------\n", + "Solar System\n", + "For more details about our Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", + "Earth\n", + "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", + "-------\n", + "-------Chunk 3------\n", + "Earth\n", + "Basic facts about Earth:\n", + "· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", + "· Rotation Period: 24 hours (one day)\n", + "· Moons: One moon, called Luna or simply \"the Moon\".\n", + "-------\n" + ] + } + ], + "source": [ + "for f in output_df['filename'].unique():\n", + " print ('==========' , f, '===========')\n", + " chunks = output_df[output_df['filename'] == f]['contents']\n", + " for idx , chunk in enumerate(chunks):\n", + " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" + ] + }, + { + "cell_type": "markdown", + "id": "20217298", + "metadata": {}, + "source": [ + "## Step-5: DOC ID generation\n", + "\n", + "This transform annotates documents with document \"ids\". It supports the following transformations of the original data:\n", + "\n", + " - Adding document hash: this enables the addition of a document hash-based id to the data. The hash is calculated with `hashlib.sha256(doc.encode(\"utf-8\")).hexdigest()`. To enable this annotation, set **hash_column** to the name of the column, where you want to store it.\n", + " - Adding integer document id: this allows the addition of an integer document id to the data that is unique across all rows in all tables provided to the transform() method. To enable this annotation, set **int_id_column** to the name of the column, where you want to store it.\n", + "\n", + "**This is a pre-requisite for fuzzy dedup** in the pipeline." + ] + }, + { + "cell_type": "markdown", + "id": "66811f5b", + "metadata": {}, + "source": [ + "### 5.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "1f747c0d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-3: Processing input='output/02_chunk_out' --> output='output/03_docid_out'\n" + ] + } + ], + "source": [ + "\n", + "# Input for this stage is the output of exact dedeup component\n", + "# output of this component makes it possible for fdedup component to run on data.\n", + "\n", + "STAGE = 3\n", + "\n", + "input_folder = output_chunk_dir\n", + "output_folder = output_docid_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "18aa0fe1", + "metadata": {}, + "source": [ + "### 5.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "f6e9e145", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:50:16 INFO - Running locally\n", + "18:50:16 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", + "18:50:16 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", + "18:50:16 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:50:16 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "18:50:16 INFO - pipeline id pipeline_id\n", + "18:50:16 INFO - code location None\n", + "18:50:16 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", + "18:50:16 INFO - actor creation delay 0\n", + "18:50:16 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_id', 'job type': 'ray', 'job id': 'job_id'}\n", + "2024-09-18 18:50:17,977\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - orchestrator started at 2024-09-18 18:50:19\n", + "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - Number of files is 2, source profile {'max_file_size': 0.008135795593261719, 'min_file_size': 0.008058547973632812, 'total_file_size': 0.01619434356689453}\n", + "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.074102020822465, 'object_store': 4.037051009945571}\n", + "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - Completed processing 2 files in 0.013 min\n", + "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - done flushing in 0.001 sec\n", + "18:50:29 INFO - Completed execution in 0.231 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:3 completed successfully\n", + "CPU times: user 107 ms, sys: 137 ms, total: 244 ms\n", + "Wall time: 15.1 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from data_processing_ray.runtime.ray import RayTransformLauncher\n", + "from doc_id_transform_ray import DocIDRayTransformRuntimeConfiguration\n", + "\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", + "params = {\n", + " # where to run\n", + " \"run_locally\": True,\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # orchestrator\n", + " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", + " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", + " # doc id configuration\n", + " \"doc_id_doc_column\": \"contents\",\n", + " \"doc_id_hash_column\": \"chunk_hash\",\n", + " \"doc_id_int_column\": \"chunk_id\",\n", + "}\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# launch\n", + "\n", + "launcher = RayTransformLauncher(DocIDRayTransformRuntimeConfiguration())\n", + "\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"❌ Ray job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "4954402f", + "metadata": {}, + "source": [ + "### 5.3 - Inspect Generated output\n", + "\n", + "You will notice we have two extra columns\n", + "\n", + "- **hash_column**\n", + "- **int_id_column**\n", + "\n", + "But still the same number or rows as before" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "1911179a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (8, 15)\n", + "Output data dimensions (rows x columns)= (8, 17)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filenamecontentsdoc_jsonpathpage_numberbboxchunk_hashchunk_id
0mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0
1mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...1
2mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...2
3mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfBasic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...3
4earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...4
5earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...5
6earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...6
7earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...7
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "0 mars.pdf 1 0 11 \n", + "1 mars.pdf 1 0 11 \n", + "2 mars.pdf 1 0 11 \n", + "3 mars.pdf 1 0 11 \n", + "4 earth.pdf 1 0 11 \n", + "5 earth.pdf 1 0 11 \n", + "6 earth.pdf 1 0 11 \n", + "7 earth.pdf 1 0 11 \n", + "\n", + " document_id ext \\\n", + "0 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "1 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "2 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "3 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "4 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "5 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "6 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "7 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "1 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "2 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "3 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "4 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "5 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "6 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "7 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "3 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", + "4 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "5 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "7 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox \\\n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", + "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", + "\n", + " chunk_hash chunk_id \n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 1 \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 2 \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 3 \n", + "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 \n", + "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 5 \n", + "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 6 \n", + "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 7 " + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "852829dc", + "metadata": {}, + "source": [ + "## Step-6: Exact Dedup\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "5acfd3a2-a236-4143-bcfc-15804f1da7fe", + "metadata": { + "id": "5acfd3a2-a236-4143-bcfc-15804f1da7fe" + }, + "source": [ + "### 6.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "4c7a1b94", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "4c7a1b94", + "outputId": "7998935d-3f72-4617-ea03-fd2a40ad9f23" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-4: Processing input='output/03_docid_out' --> output='output/04_exact_dedupe_out'\n" + ] + } + ], + "source": [ + "STAGE = 4\n", + "\n", + "input_folder = output_docid_dir # previous output folder is the input folder for the current stage\n", + "output_folder = output_exact_dedupe_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "3661cb37-39c7-4b09-a784-925bfa9eaf1e", + "metadata": { + "id": "3661cb37-39c7-4b09-a784-925bfa9eaf1e" + }, + "source": [ + "### 6.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", + "outputId": "aa460fea-a393-47d3-b084-59d47f26f0a7" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:50:31 INFO - Running locally\n", + "18:50:31 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None, 'hash_cpu': 0.5, 'num_hashes': 2}\n", + "18:50:31 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", + "18:50:31 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:50:31 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "18:50:31 INFO - pipeline id pipeline_id\n", + "18:50:31 INFO - code location None\n", + "18:50:31 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", + "18:50:31 INFO - actor creation delay 0\n", + "18:50:31 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}\n", + "2024-09-18 18:50:33,176\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:34 INFO - orchestrator started at 2024-09-18 18:50:34\n", + "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:34 INFO - Number of files is 2, source profile {'max_file_size': 0.009340286254882812, 'min_file_size': 0.0092620849609375, 'total_file_size': 0.018602371215820312}\n", + "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:34 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.064273834228516, 'object_store': 4.032136917114258}\n", + "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:34 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:34 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:35 INFO - Completed processing 2 files in 0.014 min\n", + "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:35 INFO - done flushing in 0.001 sec\n", + "18:50:45 INFO - Completed execution in 0.23 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:4 completed successfully\n", + "CPU times: user 99.9 ms, sys: 168 ms, total: 268 ms\n", + "Wall time: 15.1 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from data_processing_ray.runtime.ray import RayTransformLauncher\n", + "from ededup_transform_ray import EdedupRayTransformRuntimeConfiguration\n", + "\n", + "\n", + "# Prepare the commandline params\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", + "params = {\n", + " # where to run\n", + " \"run_locally\": True,\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # orchestrator\n", + " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", + " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", + " # ededup parameters\n", + " \"ededup_hash_cpu\": 0.5,\n", + " \"ededup_num_hashes\": 2,\n", + " \"ededup_doc_column\": \"contents\",\n", + " \"ededup_doc_id_column\": \"chunk_hash\",\n", + "}\n", + "\n", + "# Pass the commandline params\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# create launcher\n", + "launcher = RayTransformLauncher(EdedupRayTransformRuntimeConfiguration())\n", + "# launch\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"❌ Ray job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "eaf1c3c3", + "metadata": { + "id": "eaf1c3c3" + }, + "source": [ + "### 6.3 - Inspect Generated output" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "d824ebf6", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 358 + }, + "id": "d824ebf6", + "outputId": "89f1013d-6dcf-418f-a0d7-5f78b19b74ac" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (8, 17)\n", + "Output data dimensions (rows x columns)= (7, 18)\n", + "Input chunks before exact dedupe : 8\n", + "Output chunks after exact dedupe : 7\n", + "Duplicate chunks removed : 1\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filenamecontentsdoc_jsonpathpage_numberbboxchunk_hashchunk_idremoved
0mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0[]
1mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...1[]
2mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...2[]
3mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfBasic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...3[]
4earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...5[44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...
5earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...6[]
6earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...7[]
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "0 mars.pdf 1 0 11 \n", + "1 mars.pdf 1 0 11 \n", + "2 mars.pdf 1 0 11 \n", + "3 mars.pdf 1 0 11 \n", + "4 earth.pdf 1 0 11 \n", + "5 earth.pdf 1 0 11 \n", + "6 earth.pdf 1 0 11 \n", + "\n", + " document_id ext \\\n", + "0 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "1 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "2 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "3 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "4 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "5 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "6 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "1 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "2 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "3 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "4 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "5 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "6 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "3 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", + "4 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "5 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "6 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox \\\n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", + "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "4 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "5 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "6 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", + "\n", + " chunk_hash chunk_id \\\n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 1 \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 2 \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 3 \n", + "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 5 \n", + "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 6 \n", + "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 7 \n", + "\n", + " removed \n", + "0 [] \n", + "1 [] \n", + "2 [] \n", + "3 [] \n", + "4 [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567... \n", + "5 [] \n", + "6 [] " + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "print (f\"Input chunks before exact dedupe : {input_df.shape[0]:,}\")\n", + "print (f\"Output chunks after exact dedupe : {output_df.shape[0]:,}\")\n", + "print (\"Duplicate chunks removed : \", (input_df.shape[0] - output_df.shape[0]))\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "82cc9bb0", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 112 + }, + "id": "82cc9bb0", + "outputId": "293489a5-a840-4d5c-fafd-245db30d81c0" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontents
0mars.pdfSolar System\\nOur solar system is a vast and f...
1mars.pdfSolar System\\nFor more details about the Solar...
2mars.pdfMars\\nMars, the fourth planet from the Sun, is...
3mars.pdfBasic facts about Mars:\\n· Distance from the S...
4earth.pdfSolar System\\nFor more details about our Solar...
5earth.pdfEarth\\nEarth is the third planet from the Sun....
6earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...
\n", + "
" + ], + "text/plain": [ + " filename contents\n", + "0 mars.pdf Solar System\\nOur solar system is a vast and f...\n", + "1 mars.pdf Solar System\\nFor more details about the Solar...\n", + "2 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", + "3 mars.pdf Basic facts about Mars:\\n· Distance from the S...\n", + "4 earth.pdf Solar System\\nFor more details about our Solar...\n", + "5 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", + "6 earth.pdf Earth\\nBasic facts about Earth:\\n· Distance fr..." + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output_df[['filename', 'contents']]" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "cc61dffa", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cc61dffa", + "outputId": "cf6393e6-c4c7-4606-87e5-892c26b28801" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "========== mars.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "-------\n", + "-------Chunk 1------\n", + "Solar System\n", + "For more details about the Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", + "Mars\n", + "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", + "-------\n", + "-------Chunk 3------\n", + "Basic facts about Mars:\n", + "· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", + "· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", + "· Moons: Two small moons, Phobos and Deimos.\n", + "-------\n", + "========== earth.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "For more details about our Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 1------\n", + "Earth\n", + "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", + "-------\n", + "-------Chunk 2------\n", + "Earth\n", + "Basic facts about Earth:\n", + "· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", + "· Rotation Period: 24 hours (one day)\n", + "· Moons: One moon, called Luna or simply \"the Moon\".\n", + "-------\n" + ] + } + ], + "source": [ + "for f in output_df['filename'].unique():\n", + " print ('==========' , f, '===========')\n", + " chunks = output_df[output_df['filename'] == f]['contents']\n", + " for idx , chunk in enumerate(chunks):\n", + " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" + ] + }, + { + "cell_type": "markdown", + "id": "383f40ba", + "metadata": { + "id": "383f40ba" + }, + "source": [ + "### 6.4 - Understanding the output\n", + "\n", + "Remember we had 8 chunks initially. Now we have 7! One duplicate chunk is removed.\n", + "\n", + "If you look at the PDF, the following common paragraph in `earth.pdf` and `mars.pdf` is removed from one of the documents! Pretty neat, eh!\n", + "\n", + "```text\n", + "## Solar System\n", + "\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "85309751-8556-41c6-ac32-84acc941bc8d", + "metadata": { + "id": "85309751-8556-41c6-ac32-84acc941bc8d" + }, + "source": [ + "## Step-7: Fuzzy Dedup\n", + "\n", + "Post exact deduplication, fuzzy deduplication is applied with the goal of removing code files that may have **slight variations** and thereby unbiasing\n", + "the data further.\n", + "\n", + "Small variations are quite commonly seen in code data in the form of variations in the values of variables, addittion of logging statements etc." + ] + }, + { + "cell_type": "markdown", + "id": "fcf574a3-b287-419c-9c86-07b828b41ca6", + "metadata": { + "id": "fcf574a3-b287-419c-9c86-07b828b41ca6" + }, + "source": [ + "### 7.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399", + "outputId": "4548fff6-f86f-45d4-a812-49aa061fdef2" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-5: Processing input='output/03_docid_out' --> output='output/05_fuzzy_dedupe_out'\n" + ] + } + ], + "source": [ + "## Input to this component is the output of doc_id generator component.\n", + "\n", + "STAGE = 5\n", + "\n", + "input_folder = output_docid_dir # previous output folder is the input folder for the current stage\n", + "output_folder = output_fuzzy_dedupe_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "f4c82a8f-b513-4fe5-b172-d41b104b54f3", + "metadata": { + "id": "f4c82a8f-b513-4fe5-b172-d41b104b54f3" + }, + "source": [ + "### 7.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f", + "outputId": "1164345a-93db-4f8e-ad34-58a1c3d0c116" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:50:46 INFO - Running locally\n", + "18:50:46 INFO - fuzzy dedup params are {'doc_column': 'contents', 'id_column': 'chunk_id', 'cluster_column': 'chunk_hash', 'bucket_cpu': 0.3, 'mhash_cpu': 0.3, 'doc_cpu': 0.3, 'num_doc_actors': 1, 'num_minhash_actors': 1, 'num_bucket_actors': 1, 'num_preprocessors': 1, 'num_permutations': 64, 'threshold': 0.7, 'shingles_size': 5, 'delimiters': ' ', 'snapshot_delay': 1, 'use_bucket_snapshot': False, 'use_doc_snapshot': False, 'random_delay_limit': 10, 'worker_options': {'num_cpus': 1}}\n", + "18:50:46 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/05_fuzzy_dedupe_out\n", + "18:50:46 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:50:46 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "18:50:46 INFO - pipeline id pipeline_id\n", + "18:50:46 INFO - code location None\n", + "18:50:46 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", + "18:50:46 INFO - actor creation delay 0\n", + "18:50:46 INFO - job details {'job category': 'preprocessing', 'job name': 'fdedup', 'job type': 'ray', 'job id': 'job_id'}\n", + "2024-09-18 18:50:48,381\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - orchestrator started at 2024-09-18 18:50:49\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - Number of files is 2, source profile {'max_file_size': 0.009340286254882812, 'min_file_size': 0.0092620849609375, 'total_file_size': 0.018602371215820312}\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.067702485248446, 'object_store': 4.033851241692901}\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - starting run from the beginning\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - continuing from the very beginning\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - Fuzzy: num buckets 8, bucket length 8\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - created 1 bucket actors\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - created 1 minhash actors\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - Table preprocessing uses 1 readers\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - created 1 table processor actors\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:57 INFO - Completed 1 files in 0.131 min\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:57 INFO - Completed 1 files (50.0%) in 0.131 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:02 INFO - Completed processing 2 files in 0.215 min\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:02 INFO - creating minhash snapshots\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:03 INFO - minhash snapshots created\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:03 INFO - creating bucket snapshots\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:04 INFO - bucket snapshots created\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:04 INFO - created 1 document actors\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:04 INFO - created 1 bucket processor actors\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:04 INFO - created bucket processor invoker\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:04 INFO - added invoker to bucket collectors\n", + "\u001b[36m(BucketsHash pid=1218636)\u001b[0m 18:51:04 INFO - processing buckets 0 long, 53 short\n", + "\u001b[36m(BucketsHash pid=1218636)\u001b[0m 18:51:04 INFO - Done submitting long buckets\n", + "\u001b[36m(BucketsHashProcessorInvoker pid=1219171)\u001b[0m 18:51:05 INFO - Waiting bucket processing completion. Submitted requests 1\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:05 INFO - Done processing buckets in 0.011 min\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:05 INFO - creating document snapshots\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:06 INFO - document snapshots created\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:06 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:12 INFO - Completed processing 2 files in 0.098 min\n", + "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:12 INFO - done flushing in 0.001 sec\n", + "18:51:22 INFO - Completed execution in 0.592 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:5 completed successfully\n", + "CPU times: user 174 ms, sys: 166 ms, total: 341 ms\n", + "Wall time: 36.7 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "import os\n", + "import sys\n", + "\n", + "from data_processing.utils import ParamsUtils\n", + "from fdedup_transform_ray import FdedupRayTransformConfiguration\n", + "from data_processing_ray.runtime.ray import RayTransformLauncher\n", + "\n", + "# create parameters\n", + "\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", + "code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n", + "params = {\n", + " # where to run\n", + " \"run_locally\": True,\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # Orchestration parameters\n", + " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", + " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", + " # columns used\n", + " \"fdedup_doc_column\": \"contents\",\n", + " \"fdedup_id_column\": \"chunk_id\",\n", + " \"fdedup_cluster_column\": \"chunk_hash\",\n", + " # infrastructure\n", + " \"fdedup_bucket_cpu\": 0.3,\n", + " \"fdedup_doc_cpu\": 0.3,\n", + " \"fdedup_mhash_cpu\": 0.3,\n", + " \"fdedup_num_doc_actors\": 1,\n", + " \"fdedup_num_bucket_actors\": 1,\n", + " \"fdedup_num_minhash_actors\": 1,\n", + " \"fdedup_num_preprocessors\": 1,\n", + " # fuzzy parameters\n", + " \"fdedup_num_permutations\": 64,\n", + " \"fdedup_threshold\": 0.7, # (default 0.8)\n", + " \"fdedup_shingles_size\": 5,\n", + " \"fdedup_delimiters\": \" \"\n", + "}\n", + "\n", + "# Pass commandline params\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# launch\n", + "\n", + "launcher = RayTransformLauncher(FdedupRayTransformConfiguration())\n", + "\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"❌ Ray job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "a6f8cd11", + "metadata": { + "id": "a6f8cd11" + }, + "source": [ + "### 7.3 - Inspect Generated output" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "e899ad60", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 222 + }, + "id": "e899ad60", + "outputId": "70d040ab-b1d5-4797-f725-11982ef82413" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (8, 17)\n", + "Output data dimensions (rows x columns)= (6, 17)\n", + "Duplicate chunks removed by fuzzy-dedupe: 2\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filenamecontentsdoc_jsonpathpage_numberbboxchunk_idchunk_hash
0mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...04
1mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...15
2mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...2-1
3mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfBasic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....3-1
4earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...6-1
5earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...7-1
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "0 mars.pdf 1 0 11 \n", + "1 mars.pdf 1 0 11 \n", + "2 mars.pdf 1 0 11 \n", + "3 mars.pdf 1 0 11 \n", + "4 earth.pdf 1 0 11 \n", + "5 earth.pdf 1 0 11 \n", + "\n", + " document_id ext \\\n", + "0 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "1 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "2 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "3 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "4 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "5 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "1 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "2 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "3 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "4 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "5 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "3 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", + "4 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "5 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox chunk_id \\\n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... 0 \n", + "1 1 [133.18510437, 570.83258057, 374.99838257, 581... 1 \n", + "2 1 [132.87440491, 500.84011841, 477.48345947, 534... 2 \n", + "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... 3 \n", + "4 1 [132.91053772, 512.46295166, 477.84887695, 534... 6 \n", + "5 1 [133.30151367, 494.86206055, 240.17156982, 505... 7 \n", + "\n", + " chunk_hash \n", + "0 4 \n", + "1 5 \n", + "2 -1 \n", + "3 -1 \n", + "4 -1 \n", + "5 -1 " + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "print (\"Duplicate chunks removed by fuzzy-dedupe: \", (input_df.shape[0] - output_df.shape[0]))\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "ab7ea52b", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 81 + }, + "id": "ab7ea52b", + "outputId": "13a1847a-bdd1-4dc9-a281-a8faac59c3a8" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontents
0mars.pdfSolar System\\nOur solar system is a vast and f...
1mars.pdfSolar System\\nFor more details about the Solar...
2mars.pdfMars\\nMars, the fourth planet from the Sun, is...
3mars.pdfBasic facts about Mars:\\n· Distance from the S...
4earth.pdfEarth\\nEarth is the third planet from the Sun....
5earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...
\n", + "
" + ], + "text/plain": [ + " filename contents\n", + "0 mars.pdf Solar System\\nOur solar system is a vast and f...\n", + "1 mars.pdf Solar System\\nFor more details about the Solar...\n", + "2 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", + "3 mars.pdf Basic facts about Mars:\\n· Distance from the S...\n", + "4 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", + "5 earth.pdf Earth\\nBasic facts about Earth:\\n· Distance fr..." + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output_df[['filename', 'contents']]" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "6bdd3515", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "6bdd3515", + "outputId": "5a214fa3-c420-42d7-dcab-574b661e0cd8" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "========== mars.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "-------\n", + "-------Chunk 1------\n", + "Solar System\n", + "For more details about the Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", + "Mars\n", + "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", + "-------\n", + "-------Chunk 3------\n", + "Basic facts about Mars:\n", + "· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", + "· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", + "· Moons: Two small moons, Phobos and Deimos.\n", + "-------\n", + "========== earth.pdf ===========\n", + "-------Chunk 0------\n", + "Earth\n", + "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", + "-------\n", + "-------Chunk 1------\n", + "Earth\n", + "Basic facts about Earth:\n", + "· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", + "· Rotation Period: 24 hours (one day)\n", + "· Moons: One moon, called Luna or simply \"the Moon\".\n", + "-------\n" + ] + } + ], + "source": [ + "for f in output_df['filename'].unique():\n", + " print ('==========' , f, '===========')\n", + " chunks = output_df[output_df['filename'] == f]['contents']\n", + " for idx , chunk in enumerate(chunks):\n", + " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" + ] + }, + { + "cell_type": "markdown", + "id": "2b34d9c6", + "metadata": { + "id": "2b34d9c6" + }, + "source": [ + "### 7.4- Understanding the output\n", + "\n", + "So we started with 7 rows and ended up with 6. Fuzzy dedupe removed the following **very similar** chunk.\n", + "\n", + "These are pretty similar chunks except for the words 'the' and 'our'\n", + "\n", + "**earth.pdf**\n", + "\n", + "`For more details about *our* Solar system see Chapter 1.`\n", + "\n", + "**mars.pdf**\n", + "\n", + "`For more details about *the* Solar system see Chapter 1.`\n", + "\n", + "Pretty neat, eh? 👏\n", + "\n", + "### Configuring Fuzzy de-dupe\n", + "\n", + "You can tweak fuzzy dedupe by tweaking the following parameters\n", + "\n", + "```python\n", + "# fuzzy parameters\n", + " \"fdedup_num_permutations\": 64,\n", + " \"fdedup_threshold\": 0.7, # (default 0.8)\n", + " \"fdedup_shingles_size\": 5,\n", + " \"fdedup_delimiters\": \" \"\n", + "```\n", + "\n", + "In our case, we set `fdedup_threshold` parameter to 0.7. \n" + ] + }, + { + "cell_type": "markdown", + "id": "5370950a-2a3a-4143-8218-f9b4808099ba", + "metadata": { + "id": "5370950a-2a3a-4143-8218-f9b4808099ba" + }, + "source": [ + "## Step-8: Text encoding\n", + "\n", + "Encode text for the vector storage." + ] + }, + { + "cell_type": "markdown", + "id": "85aba685", + "metadata": { + "id": "85aba685" + }, + "source": [ + "### 8.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "20a153fa-fd56-401e-86be-4f7617affcc8", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "20a153fa-fd56-401e-86be-4f7617affcc8", + "outputId": "1c7835d1-1f2c-4545-8533-d9ab7a3ad0aa" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-6: Processing input='output/05_fuzzy_dedupe_out' --> output='output/06_embeddings_out'\n" + ] + } + ], + "source": [ + "STAGE = 6\n", + "\n", + "input_folder = output_fuzzy_dedupe_dir # previous output folder is the input folder for the current stage\n", + "output_folder = output_embeddings_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "c97545f4", + "metadata": { + "id": "c97545f4" + }, + "source": [ + "### 8.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "228df6b2-bc62-494b-9697-03ece98d7853", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "228df6b2-bc62-494b-9697-03ece98d7853", + "outputId": "91dd893c-3056-4d2a-bffe-49645e584a12" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "18:51:23 INFO - Running locally\n", + "18:51:23 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", + "18:51:23 INFO - data factory data_ is using local data access: input_folder - output/05_fuzzy_dedupe_out output_folder - output/06_embeddings_out\n", + "18:51:23 INFO - data factory data_ max_files -1, n_sample -1\n", + "18:51:23 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "18:51:23 INFO - pipeline id pipeline_id\n", + "18:51:23 INFO - code location None\n", + "18:51:23 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", + "18:51:23 INFO - actor creation delay 0\n", + "18:51:23 INFO - job details {'job category': 'preprocessing', 'job name': 'text_encoder', 'job type': 'ray', 'job id': 'job_id'}\n", + "2024-09-18 18:51:25,784\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:28 INFO - orchestrator started at 2024-09-18 18:51:28\n", + "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:28 INFO - Number of files is 2, source profile {'max_file_size': 0.008937835693359375, 'min_file_size': 0.00830841064453125, 'total_file_size': 0.017246246337890625}\n", + "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:28 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.01370926015079, 'object_store': 4.0068546291440725}\n", + "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:28 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:28 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:33 INFO - Completed processing 2 files in 0.084 min\n", + "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:34 INFO - done flushing in 0.001 sec\n", + "18:51:44 INFO - Completed execution in 0.334 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:6 completed successfully\n", + "CPU times: user 611 ms, sys: 194 ms, total: 805 ms\n", + "Wall time: 22.1 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from text_encoder_transform_ray import TextEncoderRayTransformConfiguration\n", + "\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", + "params = {\n", + " # where to run\n", + " \"run_locally\": True,\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # orchestrator\n", + " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", + " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", + " # text_encoder\n", + " \"text_encoder_model_name\": MY_CONFIG.EMBEDDING_MODEL,\n", + "}\n", + "\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "# create launcher\n", + "launcher = RayTransformLauncher(TextEncoderRayTransformConfiguration())\n", + "# Launch the ray actor(s) to process the input\n", + "\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"❌ Ray job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "b734852c", + "metadata": { + "id": "b734852c" + }, + "source": [ + "### 8.3 - Inspect Generated output\n", + "\n", + "You will see a column called `embeddings` added at the end. This the text content converted into vectors or embeddings. We used the model `sentence-transformers/all-MiniLM-L6-v2`" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "7b1c1d09", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 205 + }, + "id": "7b1c1d09", + "outputId": "9e695b9d-f196-4cb7-c56f-3789251e7860" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (6, 17)\n", + "Output data dimensions (rows x columns)= (6, 18)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filenamecontentsdoc_jsonpathpage_numberbboxchunk_idchunk_hashembeddings
0mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...04[0.0077404897, -0.020559434, 0.026426662, 0.01...
1mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...15[-0.051861413, 0.0035226392, 0.030617053, 0.04...
2mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...2-1[0.07728298, 0.024971062, -0.04318075, 0.05809...
3mars.pdf1011528221ef-005b-4df1-a057-84a012239ed0pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:49:46.0098302.004444mars.pdfBasic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....3-1[0.1059802, 0.025460616, 0.02362733, 0.0390564...
4earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...6-1[0.0724358, -0.058001805, -0.01977186, -0.0243...
5earth.pdf1011973d284f-30a5-464b-bfb9-28dacd2832f5pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:49:45.9377011.966178earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...7-1[0.091821924, 0.015197907, 0.07716932, 0.01711...
\n", + "
" + ], + "text/plain": [ + " filename num_pages num_tables num_doc_elements \\\n", + "0 mars.pdf 1 0 11 \n", + "1 mars.pdf 1 0 11 \n", + "2 mars.pdf 1 0 11 \n", + "3 mars.pdf 1 0 11 \n", + "4 earth.pdf 1 0 11 \n", + "5 earth.pdf 1 0 11 \n", + "\n", + " document_id ext \\\n", + "0 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "1 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "2 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "3 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", + "4 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "5 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "1 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "2 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "3 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", + "4 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "5 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "3 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", + "4 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "5 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox chunk_id \\\n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... 0 \n", + "1 1 [133.18510437, 570.83258057, 374.99838257, 581... 1 \n", + "2 1 [132.87440491, 500.84011841, 477.48345947, 534... 2 \n", + "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... 3 \n", + "4 1 [132.91053772, 512.46295166, 477.84887695, 534... 6 \n", + "5 1 [133.30151367, 494.86206055, 240.17156982, 505... 7 \n", + "\n", + " chunk_hash embeddings \n", + "0 4 [0.0077404897, -0.020559434, 0.026426662, 0.01... \n", + "1 5 [-0.051861413, 0.0035226392, 0.030617053, 0.04... \n", + "2 -1 [0.07728298, 0.024971062, -0.04318075, 0.05809... \n", + "3 -1 [0.1059802, 0.025460616, 0.02362733, 0.0390564... \n", + "4 -1 [0.0724358, -0.058001805, -0.01977186, -0.0243... \n", + "5 -1 [0.091821924, 0.015197907, 0.07716932, 0.01711... " + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "f5e12630-be6b-4188-a925-77117155617b", + "metadata": { + "id": "f5e12630-be6b-4188-a925-77117155617b" + }, + "source": [ + "## Step-9: Copy output to final output dir" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", + "outputId": "e6a04d78-b8e9-431a-e9f5-1f9ad1aee3a7" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Copied output from 'output/06_embeddings_out' --> 'output/output_final'\n" + ] + } + ], + "source": [ + "import shutil\n", + "\n", + "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER_FINAL, ignore_errors=True)\n", + "shutil.copytree(src=output_folder, dst=MY_CONFIG.OUTPUT_FOLDER_FINAL)\n", + "\n", + "print (f\"✅ Copied output from '{output_folder}' --> '{MY_CONFIG.OUTPUT_FOLDER_FINAL}'\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dc0a6728", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/notebooks/intro/images/data-prep-kit-3-workflow.excalidraw b/examples/notebooks/intro/images/data-prep-kit-3-workflow.excalidraw new file mode 100644 index 000000000..c0525c556 --- /dev/null +++ b/examples/notebooks/intro/images/data-prep-kit-3-workflow.excalidraw @@ -0,0 +1,2832 @@ +{ + "type": "excalidraw", + "version": 2, + "source": "https://excalidraw.com", + "elements": [ + { + "type": "image", + "version": 128, + "versionNonce": 146671843, + "index": "b45", + "isDeleted": false, + "id": "nQdFTOsh8Rjwn3poFcnOO", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 258.1818181818182, + "y": 213.63636363636363, + "strokeColor": "transparent", + "backgroundColor": "transparent", + "width": 64, + "height": 64, + "seed": 222183398, + "groupIds": [ + "4aSnKsxGoqeqA7eYu4s2e" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726186954844, + "link": null, + "locked": false, + "status": "saved", + "fileId": "83ba3062a1490699e3ccc129acb25b1f4ec5534d", + "scale": [ + 1, + 1 + ] + }, + { + "type": "image", + "version": 240, + "versionNonce": 2054222979, + "index": "b46", + "isDeleted": false, + "id": "hlPJZs7lUbLYhuRbSmYHs", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 260.90909090909093, + "y": 285.4545454545455, + "strokeColor": "transparent", + "backgroundColor": "transparent", + "width": 64, + "height": 64, + "seed": 961787386, + "groupIds": [ + "4aSnKsxGoqeqA7eYu4s2e" + ], + "frameId": null, + "roundness": null, + "boundElements": [ + { + "id": "FVhCmDYbWjGck9rgcESwp", + "type": "arrow" + }, + { + "id": "JMprrs8mNVD4CrqUlVm7i", + "type": "arrow" + } + ], + "updated": 1726186954844, + "link": null, + "locked": false, + "status": "saved", + "fileId": "83ba3062a1490699e3ccc129acb25b1f4ec5534d", + "scale": [ + 1, + 1 + ] + }, + { + "type": "arrow", + "version": 2550, + "versionNonce": 1240871476, + "index": "b47", + "isDeleted": false, + "id": "FVhCmDYbWjGck9rgcESwp", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 823.5583207607388, + "y": 273.73602641681657, + "strokeColor": "#2f9e44", + "backgroundColor": "transparent", + "width": 154.2895204048931, + "height": 2.3372664247598323, + "seed": 1954615226, + "groupIds": [], + "frameId": null, + "roundness": { + "type": 2 + }, + "boundElements": [], + "updated": 1726708776348, + "link": null, + "locked": false, + "startBinding": { + "elementId": "Wxv71stEiYRpNjyhzzXgO", + "focus": 1.202109076005182, + "gap": 9.103775306193256, + "fixedPoint": null + }, + "endBinding": null, + "lastCommittedPoint": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "points": [ + [ + 0, + 0 + ], + [ + 154.2895204048931, + 2.3372664247598323 + ] + ] + }, + { + "type": "text", + "version": 324, + "versionNonce": 1281521869, + "index": "b4M", + "isDeleted": false, + "id": "zSJvmm-7DrsR5-qRb96Kl", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 595.4118679291607, + "y": 242.27481706603328, + "strokeColor": "#1e1e1e", + "backgroundColor": "#ffc9c9", + "width": 141.51840079198635, + "height": 59.453152259008114, + "seed": 409665722, + "groupIds": [], + "frameId": null, + "roundness": null, + "boundElements": [ + { + "id": "JMprrs8mNVD4CrqUlVm7i", + "type": "arrow" + }, + { + "id": "0wYqjwjKHCGbx7CfmDR__", + "type": "arrow" + } + ], + "updated": 1726186894805, + "link": null, + "locked": false, + "fontSize": 23.781260903603247, + "fontFamily": 1, + "text": "2. split into\nchunks", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "2. split into\nchunks", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "arrow", + "version": 848, + "versionNonce": 138401069, + "index": "b4N", + "isDeleted": false, + "id": "JMprrs8mNVD4CrqUlVm7i", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 0, + "opacity": 100, + "angle": 0, + "x": 329.1268602850381, + "y": 278.24885892455757, + "strokeColor": "#2f9e44", + "backgroundColor": "#b2f2bb", + "width": 185.2530890548909, + "height": 2.823455039174007, + "seed": 1319994682, + "groupIds": [], + "frameId": null, + "roundness": { + "type": 2 + }, + "boundElements": [], + "updated": 1726186962183, + "link": null, + "locked": false, + "startBinding": { + "elementId": "hlPJZs7lUbLYhuRbSmYHs", + "focus": -1.189794049219074, + "gap": 7.205686529987929, + "fixedPoint": null + }, + "endBinding": { + "elementId": "YFlD_rDw6IwCctPG9BjYf", + "focus": 1.1403432588201572, + "gap": 6.460959750980123, + "fixedPoint": null + }, + "lastCommittedPoint": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "points": [ + [ + 0, + 0 + ], + [ + 185.2530890548909, + -2.823455039174007 + ] + ] + }, + { + "type": "text", + "version": 757, + "versionNonce": 361576332, + "index": "b4O", + "isDeleted": false, + "id": "G0k27V_VE7lyh7YGr_fts", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 0, + "opacity": 100, + "angle": 0, + "x": 1128.9917648038, + "y": 212.9780740734803, + "strokeColor": "#1e1e1e", + "backgroundColor": "#b2f2bb", + "width": 110.85037231445312, + "height": 58.225670034857664, + "seed": 970452474, + "groupIds": [], + "frameId": null, + "roundness": null, + "boundElements": [ + { + "id": "FVhCmDYbWjGck9rgcESwp", + "type": "arrow" + } + ], + "updated": 1726708803406, + "link": null, + "locked": false, + "fontSize": 23.290268013943066, + "fontFamily": 1, + "text": "4. dedupe\n(exact)", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "4. dedupe\n(exact)", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 598, + "versionNonce": 1689279715, + "index": "b4g", + "isDeleted": false, + "id": "XUbC5cWQCm-GEFrdqZW7g", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 333.94038113680745, + "y": 243.15978750685963, + "strokeColor": "#1e1e1e", + "backgroundColor": "#ffc9c9", + "width": 173.54608154296875, + "height": 28.457738187179977, + "seed": 1458850132, + "groupIds": [], + "frameId": null, + "roundness": null, + "boundElements": [ + { + "id": "JMprrs8mNVD4CrqUlVm7i", + "type": "arrow" + } + ], + "updated": 1726187078639, + "link": null, + "locked": false, + "fontSize": 22.766190549743982, + "fontFamily": 1, + "text": "1. extract text", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "1. extract text", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "image", + "version": 145, + "versionNonce": 1461008621, + "index": "b4h", + "isDeleted": false, + "id": "XH-Rt0Q5-K2g4tM9reh76", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 520.8409090909091, + "y": 209.88636363636368, + "strokeColor": "transparent", + "backgroundColor": "transparent", + "width": 64, + "height": 64, + "seed": 1159948140, + "groupIds": [ + "KKvJ56bTHwzAbN8YXYU0-" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726186894805, + "link": null, + "locked": false, + "status": "saved", + "fileId": "fffa228d79e3bc7053142e0031890d5aaf369b8a", + "scale": [ + 1, + 1 + ] + }, + { + "type": "image", + "version": 193, + "versionNonce": 1127846733, + "index": "b4i", + "isDeleted": false, + "id": "YFlD_rDw6IwCctPG9BjYf", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 520.8409090909091, + "y": 279.8863636363637, + "strokeColor": "transparent", + "backgroundColor": "transparent", + "width": 64, + "height": 64, + "seed": 1369151980, + "groupIds": [ + "KKvJ56bTHwzAbN8YXYU0-" + ], + "frameId": null, + "roundness": null, + "boundElements": [ + { + "id": "0wYqjwjKHCGbx7CfmDR__", + "type": "arrow" + }, + { + "id": "JMprrs8mNVD4CrqUlVm7i", + "type": "arrow" + } + ], + "updated": 1726186894805, + "link": null, + "locked": false, + "status": "saved", + "fileId": "fffa228d79e3bc7053142e0031890d5aaf369b8a", + "scale": [ + 1, + 1 + ] + }, + { + "type": "arrow", + "version": 753, + "versionNonce": 1832909987, + "index": "b4j", + "isDeleted": false, + "id": "0wYqjwjKHCGbx7CfmDR__", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 0, + "opacity": 100, + "angle": 0, + "x": 587.6995151292258, + "y": 276.08728311464677, + "strokeColor": "#2f9e44", + "backgroundColor": "#b2f2bb", + "width": 160.10395921482052, + "height": 0.6238794650969908, + "seed": 1397245780, + "groupIds": [], + "frameId": null, + "roundness": { + "type": 2 + }, + "boundElements": [], + "updated": 1726186894829, + "link": null, + "locked": false, + "startBinding": { + "elementId": "YFlD_rDw6IwCctPG9BjYf", + "focus": -1.1101505124640194, + "gap": 3.799080521716917, + "fixedPoint": null + }, + "endBinding": { + "elementId": "zSJvmm-7DrsR5-qRb96Kl", + "focus": -0.1259939432648205, + "gap": 10.873205622899263, + "fixedPoint": null + }, + "lastCommittedPoint": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "points": [ + [ + 0, + 0 + ], + [ + 160.10395921482052, + -0.6238794650969908 + ] + ] + }, + { + "type": "text", + "version": 19, + "versionNonce": 1725165603, + "index": "b4t", + "isDeleted": false, + "id": "56KAsZE3Fub50OzL9XJ35", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 344.7055268721148, + "y": 290.01136363636374, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 137.6798553466797, + "height": 25, + "seed": 961622755, + "groupIds": [], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726187031887, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "(pdf2parquet)", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "(pdf2parquet)", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 89, + "versionNonce": 1217800429, + "index": "b4u", + "isDeleted": false, + "id": "GEwyTqhl4LrSwcaOeKRT5", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 514.7055268721148, + "y": 356.01136363636374, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 74.97993469238281, + "height": 50, + "seed": 31755757, + "groupIds": [], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726187172155, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "parquet\nfiles", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "parquet\nfiles", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 273, + "versionNonce": 821721012, + "index": "b5F", + "isDeleted": false, + "id": "ZGkHBN9UBrJLYPIlm-KTj", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1355.555487199263, + "y": 305.51136363636374, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 118.5198974609375, + "height": 50, + "seed": 1591407981, + "groupIds": [], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708923087, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "duplicate 'B'\nis removed", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "duplicate 'B'\nis removed", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 747, + "versionNonce": 104645940, + "index": "b5G", + "isDeleted": false, + "id": "DolT9H5aqzEugA7sUfNlx", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 0, + "opacity": 100, + "angle": 0, + "x": 827.643003983931, + "y": 226.3985286189349, + "strokeColor": "#1e1e1e", + "backgroundColor": "#b2f2bb", + "width": 166.41502380371094, + "height": 29.112835017428832, + "seed": 466678605, + "groupIds": [], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708795102, + "link": null, + "locked": false, + "fontSize": 23.290268013943066, + "fontFamily": 1, + "text": "3. document id", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "3. document id", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "arrow", + "version": 1071, + "versionNonce": 474965812, + "index": "b5U", + "isDeleted": false, + "id": "cXhTkxU13WdQeAv3Z_1mR", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 0, + "opacity": 100, + "angle": 0, + "x": 1318.993474938044, + "y": 401.3233033689122, + "strokeColor": "#2f9e44", + "backgroundColor": "#b2f2bb", + "width": 0.8539592148204065, + "height": 113.62612053490295, + "seed": 605419139, + "groupIds": [], + "frameId": null, + "roundness": { + "type": 2 + }, + "boundElements": [], + "updated": 1726709016812, + "link": null, + "locked": false, + "startBinding": null, + "endBinding": null, + "lastCommittedPoint": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "points": [ + [ + 0, + 0 + ], + [ + 0.8539592148204065, + 113.62612053490295 + ] + ] + }, + { + "type": "text", + "version": 976, + "versionNonce": 988237964, + "index": "b5V", + "isDeleted": false, + "id": "Ba_pxAykcwH_ZsTbAtduc", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 0, + "opacity": 100, + "angle": 0, + "x": 1218.815207047896, + "y": 429.9549461276493, + "strokeColor": "#1e1e1e", + "backgroundColor": "#b2f2bb", + "width": 184.07017517089844, + "height": 29.112835017428832, + "seed": 1665190893, + "groupIds": [], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726709020882, + "link": null, + "locked": false, + "fontSize": 23.290268013943066, + "fontFamily": 1, + "text": "5. fuzzy dedupe", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "5. fuzzy dedupe", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 580, + "versionNonce": 693951668, + "index": "b5h", + "isDeleted": false, + "id": "XFHbtP2KmiHNNjZhz8ajW", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1299.1022727272725, + "y": 517.40625, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 410701101, + "groupIds": [ + "XhxUNIV4RRXanIHzjH6vP" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "OdGsWefGyr6uqMl0wC6mH" + } + ], + "updated": 1726708989657, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 323, + "versionNonce": 1216816692, + "index": "b5i", + "isDeleted": false, + "id": "OdGsWefGyr6uqMl0wC6mH", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1315.9786418568, + "y": 522.40625, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 13.519989013671875, + "height": 25, + "seed": 593665933, + "groupIds": [ + "XhxUNIV4RRXanIHzjH6vP" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708989657, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "A", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "XFHbtP2KmiHNNjZhz8ajW", + "originalText": "A", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 573, + "versionNonce": 1856782260, + "index": "b5j", + "isDeleted": false, + "id": "NzWqph0M7tEkeTDKLPGZR", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1301.1931818181815, + "y": 564.5880681818182, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 2053187053, + "groupIds": [ + "XhxUNIV4RRXanIHzjH6vP" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "K1QK2dyVWiWfd32P8ovQK" + } + ], + "updated": 1726708989657, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 264, + "versionNonce": 334637364, + "index": "b5k", + "isDeleted": false, + "id": "K1QK2dyVWiWfd32P8ovQK", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1317.219552473588, + "y": 569.5880681818182, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 15.219985961914062, + "height": 25, + "seed": 1350557773, + "groupIds": [ + "XhxUNIV4RRXanIHzjH6vP" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708989657, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "B", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "NzWqph0M7tEkeTDKLPGZR", + "originalText": "B", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 680, + "versionNonce": 1002365620, + "index": "b5l", + "isDeleted": false, + "id": "Lf5-FqrnO7iDVhOKUtEnT", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1306.9204545454545, + "y": 619.3267045454547, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 999837357, + "groupIds": [ + "XhxUNIV4RRXanIHzjH6vP" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "cTJ-8HZCMcNbXqDHggxAH" + } + ], + "updated": 1726708989657, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 375, + "versionNonce": 213412916, + "index": "b5m", + "isDeleted": false, + "id": "cTJ-8HZCMcNbXqDHggxAH", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1324.2668248956852, + "y": 624.3267045454547, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 12.579986572265625, + "height": 25, + "seed": 1515450637, + "groupIds": [ + "XhxUNIV4RRXanIHzjH6vP" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708989657, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "C", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "Lf5-FqrnO7iDVhOKUtEnT", + "originalText": "C", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 141, + "versionNonce": 1757726132, + "index": "b5n", + "isDeleted": false, + "id": "LK6nmMo09HhGvAeViRfcK", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1274.397727272727, + "y": 523.3664772727274, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 12, + "height": 25, + "seed": 975980397, + "groupIds": [ + "XhxUNIV4RRXanIHzjH6vP" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708989657, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "1", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "1", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 196, + "versionNonce": 761917108, + "index": "b5o", + "isDeleted": false, + "id": "LbPBuhQ2btuEnjbeSDvuK", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1278.397727272727, + "y": 569.6164772727275, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 11, + "height": 25, + "seed": 2104152525, + "groupIds": [ + "XhxUNIV4RRXanIHzjH6vP" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708993287, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "2", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "2", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 385, + "versionNonce": 800257204, + "index": "b5p", + "isDeleted": false, + "id": "tEnh5H4Dm1tA62FJY7ZnT", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1279.647727272727, + "y": 629.6164772727275, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 11, + "height": 25, + "seed": 1129349773, + "groupIds": [ + "XhxUNIV4RRXanIHzjH6vP" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726709003336, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "5", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "5", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 307, + "versionNonce": 51819060, + "index": "b5q", + "isDeleted": false, + "id": "TExMhRi4612k0BcybcpHE", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1251.2855058149858, + "y": 678.5113636363637, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 143.59986877441406, + "height": 50, + "seed": 2082336653, + "groupIds": [ + "XhxUNIV4RRXanIHzjH6vP" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708989657, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "near duplicate \nA' is removed", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "near duplicate \nA' is removed", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "arrow", + "version": 1039, + "versionNonce": 199529869, + "index": "b5r", + "isDeleted": false, + "id": "KvvwHoDnDT0vBh2bOfiTz", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 0, + "opacity": 100, + "angle": 0, + "x": 1245.243474938044, + "y": 579.5733033689121, + "strokeColor": "#2f9e44", + "backgroundColor": "#b2f2bb", + "width": 192.8960407851796, + "height": 1.126120534903066, + "seed": 1004556899, + "groupIds": [], + "frameId": null, + "roundness": { + "type": 2 + }, + "boundElements": [], + "updated": 1726188444758, + "link": null, + "locked": false, + "startBinding": null, + "endBinding": null, + "lastCommittedPoint": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "points": [ + [ + 0, + 0 + ], + [ + -192.8960407851796, + 1.126120534903066 + ] + ] + }, + { + "type": "text", + "version": 989, + "versionNonce": 923042467, + "index": "b5s", + "isDeleted": false, + "id": "cPSHqIr9Peb5h5TNxl3Bb", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 0, + "opacity": 100, + "angle": 0, + "x": 1100.5103669600053, + "y": 536.2049461276495, + "strokeColor": "#1e1e1e", + "backgroundColor": "#b2f2bb", + "width": 138.99639892578125, + "height": 29.112835017428832, + "seed": 865272429, + "groupIds": [], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726188447614, + "link": null, + "locked": false, + "fontSize": 23.290268013943066, + "fontFamily": 1, + "text": "6. vectorize", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "6. vectorize", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "diamond", + "version": 103, + "versionNonce": 679668419, + "index": "b5vV", + "isDeleted": false, + "id": "tPvUjMUp7lW3F8V3H2MGV", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 960.0454545454546, + "y": 515.5113636363637, + "strokeColor": "#1e1e1e", + "backgroundColor": "#d0bfff", + "width": 63.75, + "height": 45, + "seed": 782762477, + "groupIds": [ + "CuM_sg3LC9KTYRVST18pX" + ], + "frameId": null, + "roundness": { + "type": 2 + }, + "boundElements": [], + "updated": 1726188516836, + "link": null, + "locked": false + }, + { + "type": "diamond", + "version": 117, + "versionNonce": 224511779, + "index": "b5w", + "isDeleted": false, + "id": "uOIVUAj_hGKNiZ3NnQm2n", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 961.9204545454546, + "y": 564.5113636363637, + "strokeColor": "#1e1e1e", + "backgroundColor": "#d0bfff", + "width": 63.75, + "height": 45, + "seed": 1245990083, + "groupIds": [ + "CuM_sg3LC9KTYRVST18pX" + ], + "frameId": null, + "roundness": { + "type": 2 + }, + "boundElements": [], + "updated": 1726188516836, + "link": null, + "locked": false + }, + { + "type": "diamond", + "version": 122, + "versionNonce": 1205596301, + "index": "b5x", + "isDeleted": false, + "id": "ylh6O0GmjhRAHndHyuEo2", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 966.9204545454546, + "y": 615.7613636363637, + "strokeColor": "#1e1e1e", + "backgroundColor": "#d0bfff", + "width": 63.75, + "height": 45, + "seed": 499397773, + "groupIds": [ + "CuM_sg3LC9KTYRVST18pX" + ], + "frameId": null, + "roundness": { + "type": 2 + }, + "boundElements": [], + "updated": 1726188516836, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 260, + "versionNonce": 1136192621, + "index": "b5y", + "isDeleted": false, + "id": "ekXIjXxtZ6f2w_A-9CVUV", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 938.2855058149859, + "y": 670.7613636363637, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 107.5399169921875, + "height": 25, + "seed": 1616985635, + "groupIds": [], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726188507123, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "embeddings", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "embeddings", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 381, + "versionNonce": 1618061620, + "index": "b5z", + "isDeleted": false, + "id": "Uv-8TiLeECJuuNx1yJjtv", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 768.5454545454545, + "y": 280.72727272727275, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 637818278, + "groupIds": [ + "wECUsJGvuBUaz0aXhNgT4" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "id": "0wYqjwjKHCGbx7CfmDR__", + "type": "arrow" + }, + { + "type": "text", + "id": "B8Nj-HzRDl-FA-5UJ2hiw" + } + ], + "updated": 1726708776347, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 140, + "versionNonce": 1472181260, + "index": "b60", + "isDeleted": false, + "id": "B8Nj-HzRDl-FA-5UJ2hiw", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 783.2418233698064, + "y": 285.72727272727275, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 17.879989624023438, + "height": 25, + "seed": 1971906541, + "groupIds": [ + "wECUsJGvuBUaz0aXhNgT4" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708776347, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "A'", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "Uv-8TiLeECJuuNx1yJjtv", + "originalText": "A'", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 391, + "versionNonce": 1280205492, + "index": "b61", + "isDeleted": false, + "id": "l7XMM15Xwzq5xmDF0QvyN", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 764.090909090909, + "y": 186.09090909090912, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 1556091898, + "groupIds": [ + "wECUsJGvuBUaz0aXhNgT4" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "SZp9x_uNQ-65LQPMQ768C" + } + ], + "updated": 1726708776347, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 132, + "versionNonce": 809849484, + "index": "b62", + "isDeleted": false, + "id": "SZp9x_uNQ-65LQPMQ768C", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 780.9672782204367, + "y": 191.09090909090912, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 13.519989013671875, + "height": 25, + "seed": 912377443, + "groupIds": [ + "wECUsJGvuBUaz0aXhNgT4" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708776347, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "A", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "l7XMM15Xwzq5xmDF0QvyN", + "originalText": "A", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 413, + "versionNonce": 1599597620, + "index": "b63", + "isDeleted": false, + "id": "Wxv71stEiYRpNjyhzzXgO", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 767.1818181818182, + "y": 234.27272727272725, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 775085434, + "groupIds": [ + "wECUsJGvuBUaz0aXhNgT4" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "id": "0wYqjwjKHCGbx7CfmDR__", + "type": "arrow" + }, + { + "id": "FVhCmDYbWjGck9rgcESwp", + "type": "arrow" + }, + { + "type": "text", + "id": "zyU1230-bmsHaQTSoi7Ov" + } + ], + "updated": 1726708776347, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 102, + "versionNonce": 1402151180, + "index": "b64", + "isDeleted": false, + "id": "zyU1230-bmsHaQTSoi7Ov", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 783.2081888372248, + "y": 239.27272727272725, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 15.219985961914062, + "height": 25, + "seed": 1842733667, + "groupIds": [ + "wECUsJGvuBUaz0aXhNgT4" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708776347, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "B", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "Wxv71stEiYRpNjyhzzXgO", + "originalText": "B", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 397, + "versionNonce": 997475764, + "index": "b65", + "isDeleted": false, + "id": "IkaeA2i4mlTdmulYEI_na", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 771.3636363636363, + "y": 325.3636363636364, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 1839286010, + "groupIds": [ + "wECUsJGvuBUaz0aXhNgT4" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "IgKDOIQhfqb_x9gQh30eh" + } + ], + "updated": 1726708776347, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 89, + "versionNonce": 421732236, + "index": "b66", + "isDeleted": false, + "id": "IgKDOIQhfqb_x9gQh30eh", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 787.3900070190429, + "y": 330.3636363636364, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 15.219985961914062, + "height": 25, + "seed": 1893385699, + "groupIds": [ + "wECUsJGvuBUaz0aXhNgT4" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708776347, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "B", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "IkaeA2i4mlTdmulYEI_na", + "originalText": "B", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 440, + "versionNonce": 1439264564, + "index": "b67", + "isDeleted": false, + "id": "qGfihx9_lQSyc1F8oQTu0", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 772.909090909091, + "y": 369.01136363636374, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 1381062179, + "groupIds": [ + "wECUsJGvuBUaz0aXhNgT4" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "0DIl-np94wHje4sIubFJp" + } + ], + "updated": 1726708776347, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 133, + "versionNonce": 1496272396, + "index": "b68", + "isDeleted": false, + "id": "0DIl-np94wHje4sIubFJp", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 790.2554612593218, + "y": 374.01136363636374, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 12.579986572265625, + "height": 25, + "seed": 1722325443, + "groupIds": [ + "wECUsJGvuBUaz0aXhNgT4" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708776347, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "C", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "qGfihx9_lQSyc1F8oQTu0", + "originalText": "C", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 70, + "versionNonce": 247294132, + "index": "b69", + "isDeleted": false, + "id": "lkM4ke2d8E4KSisX5yE08", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 762.5454545454546, + "y": 429.51136363636374, + "strokeColor": "#1e1e1e", + "backgroundColor": "#d0bfff", + "width": 64.55995178222656, + "height": 25, + "seed": 1905848653, + "groupIds": [ + "wECUsJGvuBUaz0aXhNgT4" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708776347, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "chunks", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "chunks", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 527, + "versionNonce": 1269467404, + "index": "b698", + "isDeleted": false, + "id": "JNHVvikjirDDllCKotbJC", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1025.9545454545455, + "y": 275.68750000000006, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 848769955, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "8Msc7tXcZdg2UUH2NmUn-" + } + ], + "updated": 1726708934863, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 287, + "versionNonce": 1779271564, + "index": "b69G", + "isDeleted": false, + "id": "8Msc7tXcZdg2UUH2NmUn-", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1040.6509142788973, + "y": 280.68750000000006, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 17.879989624023438, + "height": 25, + "seed": 1297532739, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708934863, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "A'", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "JNHVvikjirDDllCKotbJC", + "originalText": "A'", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 565, + "versionNonce": 1888269836, + "index": "b69O", + "isDeleted": false, + "id": "fkbHGW5tJ-Ay0sh8h-9hJ", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1022.5, + "y": 182.05113636363643, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 2116216547, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "BNiP4zX7PtFTn_e_5vXX3" + } + ], + "updated": 1726708934863, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 308, + "versionNonce": 1814172812, + "index": "b69V", + "isDeleted": false, + "id": "BNiP4zX7PtFTn_e_5vXX3", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1039.3763691295276, + "y": 187.05113636363643, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 13.519989013671875, + "height": 25, + "seed": 1804210819, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708934863, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "A", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "fkbHGW5tJ-Ay0sh8h-9hJ", + "originalText": "A", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 558, + "versionNonce": 981967628, + "index": "b69d", + "isDeleted": false, + "id": "QYKbNgibs7-HxaNNr8tfG", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1024.590909090909, + "y": 229.23295454545456, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 1716177443, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "C-rwFmAbwI_qgVqpkXy7m" + } + ], + "updated": 1726708934863, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 249, + "versionNonce": 1916232076, + "index": "b69l", + "isDeleted": false, + "id": "C-rwFmAbwI_qgVqpkXy7m", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1040.6172797463155, + "y": 234.23295454545456, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 15.219985961914062, + "height": 25, + "seed": 592678339, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708934863, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "B", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "QYKbNgibs7-HxaNNr8tfG", + "originalText": "B", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 653, + "versionNonce": 1248546828, + "index": "b69t", + "isDeleted": false, + "id": "m2Wj9fp76PKCAhrulCmTa", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1027.318181818182, + "y": 365.97159090909105, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 901963107, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "MNgTOO1UYazXucNSjXZ_z" + } + ], + "updated": 1726708934863, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 348, + "versionNonce": 52260492, + "index": "b6A", + "isDeleted": false, + "id": "MNgTOO1UYazXucNSjXZ_z", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1044.6645521684127, + "y": 370.97159090909105, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 12.579986572265625, + "height": 25, + "seed": 1223112963, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708934863, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "C", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "m2Wj9fp76PKCAhrulCmTa", + "originalText": "C", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 127, + "versionNonce": 1292352780, + "index": "b6AG", + "isDeleted": false, + "id": "J1KVE_C00rdGo7FWIwu1X", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 998.7954545454545, + "y": 188.01136363636374, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 12, + "height": 25, + "seed": 1442121325, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708934863, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "1", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "1", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 181, + "versionNonce": 832846732, + "index": "b6AV", + "isDeleted": false, + "id": "TIEDsM4QhNNDJARAJnvDz", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1001.7954545454545, + "y": 234.26136363636374, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 11, + "height": 25, + "seed": 846611715, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708934863, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "2", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "2", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 229, + "versionNonce": 2066541068, + "index": "b6Al", + "isDeleted": false, + "id": "tGvqUuD_kCzfMYn-UX8o-", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1004.2954545454545, + "y": 283.01136363636374, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 12, + "height": 25, + "seed": 758667053, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708934863, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "3", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "3", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 360, + "versionNonce": 479971468, + "index": "b6B", + "isDeleted": false, + "id": "IQM8OVr381UGBDKQtda8U", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1004.0454545454545, + "y": 371.26136363636374, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 11, + "height": 25, + "seed": 618433805, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708934863, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "5", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "5", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 611, + "versionNonce": 430626572, + "index": "b6BV", + "isDeleted": false, + "id": "fJGd6Pf-SaTmbDMUGHhUW", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1028.3972327492456, + "y": 322.2812500000001, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 1491526540, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "Ax-8fSsrXvrkMhlGAgJgO" + } + ], + "updated": 1726708934863, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 302, + "versionNonce": 1859392908, + "index": "b6C", + "isDeleted": false, + "id": "Ax-8fSsrXvrkMhlGAgJgO", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1044.423603404652, + "y": 327.2812500000001, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 15.219985961914062, + "height": 25, + "seed": 1943704076, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708934863, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "B", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "fJGd6Pf-SaTmbDMUGHhUW", + "originalText": "B", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 259, + "versionNonce": 2035385356, + "index": "b6CV", + "isDeleted": false, + "id": "07qZABiLS71UbigBsFpnK", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1002.0335963856091, + "y": 327.2812500000001, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 11, + "height": 25, + "seed": 1965424820, + "groupIds": [ + "ssihZCwGeFNCQehvjAg06" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708934863, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "4", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "4", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "arrow", + "version": 2600, + "versionNonce": 1259679372, + "index": "b6D", + "isDeleted": false, + "id": "M_WCuesgPRdSQ_zqaUtz0", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1113.5321305627851, + "y": 279.97561555378826, + "strokeColor": "#2f9e44", + "backgroundColor": "transparent", + "width": 154.2895204048931, + "height": 2.3372664247598323, + "seed": 1489010356, + "groupIds": [], + "frameId": null, + "roundness": { + "type": 2 + }, + "boundElements": [], + "updated": 1726708895234, + "link": null, + "locked": false, + "startBinding": null, + "endBinding": null, + "lastCommittedPoint": null, + "startArrowhead": null, + "endArrowhead": "arrow", + "points": [ + [ + 0, + 0 + ], + [ + 154.2895204048931, + 2.3372664247598323 + ] + ] + }, + { + "type": "text", + "version": 176, + "versionNonce": 14571020, + "index": "b6E", + "isDeleted": false, + "id": "wkavhEPwz2TNGwf8xFeLA", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1263.0335963856091, + "y": 188.2812500000001, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 12, + "height": 25, + "seed": 809955212, + "groupIds": [ + "uHtPh4-PiLJtgc-p_Cdgo" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708942969, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "1", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "1", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 538, + "versionNonce": 1071049484, + "index": "b6F", + "isDeleted": false, + "id": "Qaz1byDgzm-0ZrVLBmU4v", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1288.9545454545455, + "y": 273.1875000000001, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 144156909, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "D2HbgzHXdGyxGppwaWbBy" + } + ], + "updated": 1726708966705, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 296, + "versionNonce": 2108300212, + "index": "b6G", + "isDeleted": false, + "id": "D2HbgzHXdGyxGppwaWbBy", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1303.6509142788973, + "y": 278.1875000000001, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 17.879989624023438, + "height": 25, + "seed": 2062418765, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708966705, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "A'", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "Qaz1byDgzm-0ZrVLBmU4v", + "originalText": "A'", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 569, + "versionNonce": 509454732, + "index": "b6H", + "isDeleted": false, + "id": "-LxVJeZLqj0MgI5FEg_pm", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1281.5, + "y": 179.55113636363643, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 1514803629, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "trFDjiJr6cfNlCSEKqNjE" + } + ], + "updated": 1726708966705, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 311, + "versionNonce": 1054115124, + "index": "b6I", + "isDeleted": false, + "id": "trFDjiJr6cfNlCSEKqNjE", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1298.3763691295276, + "y": 184.55113636363643, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 13.519989013671875, + "height": 25, + "seed": 1674925069, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708966705, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "A", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "-LxVJeZLqj0MgI5FEg_pm", + "originalText": "A", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 566, + "versionNonce": 713594892, + "index": "b6J", + "isDeleted": false, + "id": "Kxu9owye4gMpRvh7kJ1Nl", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1287.590909090909, + "y": 226.73295454545456, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 1938377325, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "UP92rSYiIXnnBFhov6WNx" + } + ], + "updated": 1726708966705, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 256, + "versionNonce": 301317812, + "index": "b6K", + "isDeleted": false, + "id": "UP92rSYiIXnnBFhov6WNx", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1303.6172797463157, + "y": 231.73295454545456, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 15.219985961914062, + "height": 25, + "seed": 707753165, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708966705, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "B", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "Kxu9owye4gMpRvh7kJ1Nl", + "originalText": "B", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "rectangle", + "version": 593, + "versionNonce": 5355148, + "index": "b6L", + "isDeleted": false, + "id": "KMOsOR4pOx-ute2ztnw1k", + "fillStyle": "solid", + "strokeWidth": 1, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1293.318181818182, + "y": 361.4715909090911, + "strokeColor": "#e03131", + "backgroundColor": "#ffc9c9", + "width": 47.27272727272725, + "height": 35, + "seed": 635317229, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu" + ], + "frameId": null, + "roundness": { + "type": 3 + }, + "boundElements": [ + { + "type": "text", + "id": "SsRO-f6mzQzf5jQOudz6C" + } + ], + "updated": 1726708966705, + "link": null, + "locked": false + }, + { + "type": "text", + "version": 287, + "versionNonce": 800311348, + "index": "b6M", + "isDeleted": false, + "id": "SsRO-f6mzQzf5jQOudz6C", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1310.6645521684127, + "y": 366.4715909090911, + "strokeColor": "#1e1e1e", + "backgroundColor": "transparent", + "width": 12.579986572265625, + "height": 25, + "seed": 1382819405, + "groupIds": [ + "bDrNCHlMlNcEbIn9yZXly", + "XEHMHITFJTjudNYgVFCPu" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708966705, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 5, + "text": "C", + "textAlign": "center", + "verticalAlign": "middle", + "containerId": "KMOsOR4pOx-ute2ztnw1k", + "originalText": "C", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 206, + "versionNonce": 745735436, + "index": "b6N", + "isDeleted": false, + "id": "US1PK13ekocRlMvOrHSJL", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1265.0335963856091, + "y": 231.2812500000001, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 11, + "height": 25, + "seed": 1525760780, + "groupIds": [ + "bQ__H1TgpJXskAm32UBLZ", + "XEHMHITFJTjudNYgVFCPu" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708966705, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "2", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "2", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 241, + "versionNonce": 1274323380, + "index": "b6O", + "isDeleted": false, + "id": "NxUqy-MsYDga_9XDrU9l7", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1267.5335963856091, + "y": 277.2812500000001, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 12, + "height": 25, + "seed": 1116920372, + "groupIds": [ + "4mN8vM1PMjtKHfzWdqXES", + "XEHMHITFJTjudNYgVFCPu" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708966705, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "3", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "3", + "autoResize": true, + "lineHeight": 1.25 + }, + { + "type": "text", + "version": 240, + "versionNonce": 342262668, + "index": "b6P", + "isDeleted": false, + "id": "lSEPKkiY8if2M9pDun8DS", + "fillStyle": "solid", + "strokeWidth": 2, + "strokeStyle": "solid", + "roughness": 1, + "opacity": 100, + "angle": 0, + "x": 1270.5335963856091, + "y": 370.2812500000001, + "strokeColor": "#e03131", + "backgroundColor": "transparent", + "width": 11, + "height": 25, + "seed": 932194828, + "groupIds": [ + "Z8bVLPerSCYHViV4Ld1Ed", + "XEHMHITFJTjudNYgVFCPu" + ], + "frameId": null, + "roundness": null, + "boundElements": [], + "updated": 1726708966705, + "link": null, + "locked": false, + "fontSize": 20, + "fontFamily": 8, + "text": "5", + "textAlign": "left", + "verticalAlign": "top", + "containerId": null, + "originalText": "5", + "autoResize": true, + "lineHeight": 1.25 + } + ], + "appState": { + "gridSize": 20, + "gridStep": 5, + "gridModeEnabled": false, + "viewBackgroundColor": "#ffffff" + }, + "files": { + "83ba3062a1490699e3ccc129acb25b1f4ec5534d": { + "mimeType": "image/png", + "id": "83ba3062a1490699e3ccc129acb25b1f4ec5534d", + "dataURL": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAEAAAABACAYAAACqaXHeAAAAAXNSR0IArs4c6QAABd1JREFUeF7tm39sE2UYx79vu61b2cbKVjackBHX6TpNFIzxDzaBIBInEHQQgxoCqIygyIBoDGZSRRNRcIHAnPIjDpBfhgTkp6DRiIAYfyErsk62yToDY+vItv6+vebewmyX9u5Kr9263P3V3Pvc93neT5973ifv3RGEeTSPeUCXoOEWApgJwABgJAASpoyguTotzZNQaFyQtWv7Tjl1g2mFFbjVYJwP4CMAI6IZmGp4OjTjxrup0/5cZu32L6PpSzKAVkPRuxT0rWgGc1ubB5A0fhwopU5qt5fpd35+JFp+JQFozS98gRJSG60g+uveBsCfp4AdDufMrNptp6LhXxTAdaMx1eNBA4DsaAQQTNMfAIPQS3uos7tUv2PH93LHIAqgJb9wMSFks9yOhfT6A2C2FJ2U630ia+sn5+WMRRRAq8F4hAJPyulUTCsoAN9S0+7h6OTsLdUXxDSkjosCsBqMFgD5UgXlsAsFgCUCRZuGcBPTamrMcviSAqA92sueUBEMOklKrV41SrKrq69ECkEcQIGxAxS6SB2Fc71QBvjpXFV5SbFu66bmcLT728YzABBK/+a0mmJ9VdW/dwohrgGwSRNyKXFYckn6unU37gRC/APwVcaLKq2mWFdV1RkuhKEBwLc6/EIzUifp167tCgfCkAHAJq0ip+2Ge6aOXr7cIRXC0ALAZwJRfZdpLJhGli51SYEw5AD4CqPq2IicrBlk9WqvGIShCYAxUB3STSp5msyZwwlBGJQAiFaLxAJ+synCg/N+qj+wf1HcAYhw2v9fTmDLrTcL7l4NygxQAMhFQMkA5RZQaoBSBJVVQFkGlT5A6QQFCCidoFjTZR2AXWGxmCSPK52g0gkqnaDSCcaiE8z44H0kT54UUJuo2wWusQmOYyfQ88Uetm+tnV2G9NdXBNr19MBjscB54iQcB78C9Xj6xjPeMyF56tSQNc956ht0vinw0kqsiqBuw8dImRY60J5du3HTtAbDnp+L4ZWrQk7Ia2lAe/kScFdbmI1u/YdIeSr0k3nH0eOwLQsEGiA+EABcp8+As1qhHpUDTfEE/tEV+/evTZmG5MdK+gB4LtbBY2lAwpjRSBr3kM8OgLepGW3TZ4G6XAEAXGfPgfvnasD83BfrYN+7P/SqOBAAOpa8BudJ3+s8mVtqoCmZwH7z59XZI/sAdG3cjK6Nm9hY4n33YsRn1VBn+97C6aw0wb5nXwAA27KVcBw9JrkFYIYDCUCdk43M2m1IyMtjsbTPfwkJY/OCAuDHtXPKkLHGxGx5gDww/1vAeepbeK809gFw/3GhD3RIKgMBIFgwvTYbrk18HNpnZoUEwGeB/tABdrmnzoy2WbMFa0DP7r24+fY7whkxGAD0dnXB9moFXGfOBhRB/1uA3Qb3F0F/YB+bkPvCn7hR9mx8AuDT19vYDOp2w9vUBNcPP4LPAP7wXwX6A0hd9CLSV1QwO8ehw7CtfCO+a0Cw3AwKQK1GyvRSZJgqQVJS2GW2pRVwHP96aAPg2trQ22GDetQoqNLT+ni5zp1H+7wFbOn0L4JxtwqIZUCwcb5r7FxVCdrdzYbjBkDaK4uR9PB4FnTXhk1w//pb0OqcPGUyqwP+B3U44bVY4DhxklV//yO1/GVoHn2EnequroHrp5+Fq37/0VitAuFFFUNrBYCyIaJsiCgbIrHYEIlhWQvPlVIE5SiCBmPMX5cP728WtO7ItZgzhSzEnwwZjPW3vg+UMa6YSf2VazEXRgSg1VB0mIKWxixkWR3Rw7mWS9MjAtBiMJYToFrWuGIkRoHyuy3mmogAsM/mvLCAIidGccvl5nqSisvXX74s+Pa4aA3go2k1FM2loLvkiiw2OmRurqVut5gvSQB4EWuB0QSKSjHBQTFOYMqtN6+WEotkALxYS0HRPELp+lh/RSZlIrds2gmlFXc1XNoh9ZqwAPCijXkPZiQmuRcSihkgKABln9SGrSM1QBE7CoJroKinBAc97qRtY5t+D+uzmf8A6hsfbisiXOQAAAAASUVORK5CYII=", + "created": 1711006482453, + "lastRetrieved": 1726708752969 + }, + "fffa228d79e3bc7053142e0031890d5aaf369b8a": { + "mimeType": "image/png", + "id": "fffa228d79e3bc7053142e0031890d5aaf369b8a", + "dataURL": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAEAAAABACAYAAACqaXHeAAAAAXNSR0IArs4c6QAABGRJREFUeF7tm0tsE1cUhv8Zu3FiJ3HihJCkpbxKQI55ryq6aKXQFigPBYG6aQAJEbHgKZAQCxYIlQ2gQJFSEK2gRRUtKuUhNSAC2YDEgleIbdpCpEhGEEIgNsGOnYxn0L0G14HAzI3H44w9d+N5nHt8z3fPPef4eoZDljcuy+2HAYDVA7zetgUAdwjAh6x9hyN/+Lemqw27tn42nL5K+jB7gMfj9nEcPlKiXA2Z9Tt+hGvK2JRBYAbg9bolNQxTqoMAIG2ac8K1fTs3f6q0n1I53QBIFQRdAUgFBN0BUBuCLgGoCUG3ANSCoGsAakDQPYBkIWQEgGQgZAyA4UIY8QCUVnSv5ZxOF5NNTMLkS7QuhQ0AjAQMDzCWgBEDmOIak7ARBI0sYKTBIeuA/T+fRXvHQ8aElZz4pPGVWLdq0VtK0pIGsx5AcnOpbu+0eIC6JiSnzQBgVIJpqASzPghmPYDkwpa6vY0gaATBNARBdZ04OW3GEkjFEgjU1izgAc2eCknOB0hv7gHPod7258W/5XQp2hDprZ3rAyTNngqRG7TC+76CU80fy8kqBFCj6VMhcoNWer/gVLOsfbIC5Mt6aw0Ash4g8SaEps8GzOYhJ4gP9iLvrhuhqTPBCQPI87YNkutzToVkMsPqvo2+KS6I+QVDT7QgwNp6A5wYlXUETT0g/MlkPFm59r2Dsl84h8BXCwFJwujGfch56KPy/ZVj8HjtZoDjUNT0F/xfL6HH72qjjjYi9/6/IwuAUORA94p6iCYzYLEgasunAzQH/JCiUZgiYRSfPoGu+k2QeB6WjnaUHfmBynStXofIuIngRBGjG/ei55ulEArs4EwmCPYiKmMKvgAiEfBRAaXHDsHsfzayACSOJjh9Np4t+45eqmj4Huburvht/7zF6J3zBT0vOXGUfj79diX9LLzSAvv5M3FZobQMjzZup+eOk7/C1npD1uhEAU2XgFIAUo4FnRu2QbAXw9QTm8VosQPmQA/KG3aDG+jPbADEur4qJ7rr1gyazdLjR5D3j3vQtYz0gNcWEtcmBtI40d1Fl8qbLWMBhKbNwtPldYPsLfnjF1jv3Mx8DxBt+ehcv41miQ86Y3+mDJRXgg8GUXFgN3gS7V+1jPQAkh1IliCt7KeDgCjSNEjyvq31Ohwnj2cugMRiydp2CyW/H4ulweV1IMuCNJLj8+7dpce69YCQa0Y8t1fs2RkvWp5//iUCNfNpqisn9UHAH0uDhXaa70mKtF9qQmHLhRiAIgcebdlBj0nNQMpklpa2OoBUepGJVeDDYeT4OuJjJtf7XDPAh4JvlbLhCVUQrVZYva2A+P9Pj/4x4yDm5sLS/h+tFFla2gCwDDKVsioCyPINkeDSufNFiWyJ6WZXyMeLYr3t9OUmOQ9TtCGSqMR4UFLjl6bkZvDN+ynfFtf6tTlGAD6n0yW7EZqok3kJeDyeeYB0WMt3BxVC8EkSt6a6uvq8QnkqxgyARbkeZA0AepilVI7xJZGIcV/ibAoaAAAAAElFTkSuQmCC", + "created": 1721376622438, + "lastRetrieved": 1726708752969 + } + } +} \ No newline at end of file diff --git a/examples/notebooks/intro/images/data-prep-kit-3-workflow.png b/examples/notebooks/intro/images/data-prep-kit-3-workflow.png new file mode 100644 index 0000000000000000000000000000000000000000..851adbfebc0511560625bf5e7afd35dfc9ef9d1c GIT binary patch literal 101303 zcmd43WmuH`w>AujFiJ>E=g=iccf-&rEeZojmwTS z=gAVSYeBL0v+w5*a!ysiXd=wynYWX& z-d^DQ_291d_rroQI1~EruCELVv%Y{nh4{m*-gXwEv;Ge+s6#?79j)v#op@FC*vnxt3P zcCneZq@<9AhiVvZtCML^^r39$o6quAxgJdu>l=mp>+?f}?sP>0ORZrHc#M-je^ao9dgwL`_gGBUvkM`=4QVD^1O-bCS#+ zk6_GZqaTFB!MRE=BOInjy*a)RJ28A!FMk&jQrMBHV|QOyyk`p%C18Si$pzF@*+qL@ zu+dLo3(5p#qJwPMi`?k^l`OCOgv>juPsT<(R&C{4NhP*?`Lflc;`oZvP64vtcxdwKB`Nx zv&k!y5kf%F;WYZxT8jb6WnVJf&pej46s|i8J0tvL)7sIQL=atLXuZw=qMw}KuI``M zS7!&!G%A>P7@N`)&SILpMU3_-tP*Vfg^)51VBnmq%YHLKt*L>YrdpDsVCAtGMexk4 zlxfqw(bUWq3%!l6x06ldJ$?C1)C9M93Y4SR4Zpr%kB%dP{3?zsP|5g!hx@@|Oo6nq z-?j{H6(?vvXZ=r9Q2zl*q^ulB`a1>U=!W4kGT-;$(;pPXe>63sc#`A#ixk!Qt+&SW z_S^AB^`tm0+#g8`Vhj`&Q4t2|#h_qL6sCz8Vx{Kafk$>wdUhofgu)gkuAp<= z^auJ0T8Cjo?>~nrp^ob>)7|@oHu~N2GsbMA9l4F@-rhts!iMUEi%cbto754^O$KSb z)a&(h2N_c{`ukw~sGHEzU#%pVqHl1ENdwN_clT6V#Yf1ZaS<~2y{2{jd!*Sg8G)au z9GMU>;TFn{a#g0Qv^ks#j{@vi|0PG_?cJHy_OFt7@|iqY32UWvK`lIvCK>+UD1k*p znvdye`#kr*bF?|=_fo)f_FNlXtA#onJsN4Y1OzUfayp*p`S>TM_q;-LDGKt`X5-+2 z+m46VG6q2$#!6=s^iT*atY^oZW!3(z-aL!1}M!S9B$=__1|?uA`w)u=vkJds3f@9GS)AnnlWSdNr#dwXSgdA<|5}UPcj5o77XmtX>o8tU;R~g z=XJ5u1YiHz$3l?yHhUJay)9q&{2{N#TD%Qde=0hQ*(AVlA zmm4ai;-1s(H-iXSiT2mX7x|l0Aad*YgN&4bcQ70me!JkJSoGBA3=x?ixTu!*^`TzC58#FAGAA|$o) zqljrBX5U$=4+g0JaMhs%tS%T`S?=krEZK8e!6!v`m}0M0_|*7H14}d)Gc(!0EIWlk zw>DDwM4fK-ouAHrx~sDo0q_5mr>}Iej4s>ZO!D+i1Q5d;dKu4uVTF1v?pLV6E5=cx zPMwZ>mJ(t2!QL;~2ja!d%y2&I1dTi#R)`92u?lFt%7|Gw3_Edk&vs(}&6qf+Yplq) zB?vPWxC-aB$^!J)dDFXO{GdEJ+&BcY7TD?N9sL9fueyP%4Bc-9l|+gl^r0mzGG+0! z1=nPr=6lcBYt+=Fo9kUTw?&lv_Z#OQU-fIDE8~UEreWA!t?zbnB%Oa+eYw)qzH+%3 z!NB8aRbOMgjx9=3fBG15^m+e;+@v#VionYXLq^E-TQY6%^NWK-v+FZlUq4>6&;lNt zO?MxXY##<$K|#f$tLOqE90+LL=eC+=yr5ApkBkSlnn7-Lw^OFM^+Vd{Ak=9;jhD3; zYVc!{31$_~sI&^B0e;Q<>7NMRr^w%v*lQ}(sF+9lnY_e!DEO2~J&EIoq38EIW#g8T zIG87UCt;NE(f#=$ddm!&7sFZXS_iGKU!~jH%`09btmIOQ1{4BceKLRw0_AD~L+T%p zSiAvxVXA^pupGQIgv+6WcMaTA95GtgUat`D=`K}CQ)v|UfnJoe6e z(n2JbEVBnjJA(>4LtZN0MlcR?97Jp64=-owU5DeY{=IXDP;&Rgcee>K&@sx7c3i%l zO-yB`Qciit?XmQ*JoDxCLA#-6hVy4#^Crv9y77{YJ*K7Q7dofKt!JW(&mvFg!*}w!;op5s0&ymqy zmLE;?3dC=}uD6~ik$?SI(!*o7<3csqkJ!q>_?&L3KE(tWMC8EJbVMfS!=-(M)wqoF zve#ITj;)w)>*V3mV*V&vymHIk0OQ6YA?tcBnLR9>vr*#h^h0*yL&?C^CgM6pSO0g{8nR<}e@1~Dd;5YX-9ZBZG@sL?KeRWp_M?%XoZdUOCy$5jdL9k!AaF1btD7x6c$B zaJgLO)X>nV`5p8E+Vgs>Un>Y*X-rG)+>dy!K5=BRCH=>Xoy~3W3YAF5U$M_y9M+pfp zeN9c;=~xQZ?1!VE*1tcbxj*b^L&9T|$BI-i6hxDhF;L@hU}%q(GaOf8ciH5m<6ejw zYG55p37>3f=xMNd340r?7!B|6Z7UGHD$osT790P?ADVpo z)}IbMF6|F*$1O*-xkntpbx3MB!1LfIoDu)~JZN-CsYLUoiC`e?c?337{4YLUwqV$5_H;08Ir8;i^n?fffgoyW-v9+os>`H_)|uyOY5`Kb@s+F3_4)`Ip&7Xr{A1N#)&vlI*{C!mM_u zmBzKyrh0OJiZP?;(87K-RIOLNWw&d;R?J&nD-&t2c@mx~_ruArA7QPLUr-2{>yHDH zE$_9bvx*G0%P+Qp6ka6*>sp#~|F3A$Gfdp-C(g6svy z`zv^2GNX;}jmqO6=+UqND&TO_p_t%LHF|RUEZOaC#iFSm46U1WqJXYtKYrxi!mmHP zO&WdM2rMm?7M#8(vP|UV7p_d{D&4&vJ0tCsfLTrD$~jGz(3S8nmqXL0=_(l~c~BKk zPR*x-UMMqo&s(w{j=#eNEn}9U2CnbQfKNGZ zy>_*^D=uQ%E{y(ph*zM5IICnMJXnfVkNh_Od#S=dVgNLDHoiRYEt?qM8sk3);|8&F zsLE^O79WH5u2wN-1)<>y0^|obq=fVlJq&x(V>WCkl0`a^q~<1Q9SK00tar7kx~X7f zLg9THkhj>faTv+Thw3FuwA@rnlf}xNP?-uoYuv324>muwYp@DE3@2~uw?P1%8NbDx z=|BEKodYU}C1)0lB!Z0FH9&&aVouBU9DJ{2?TPtS9ZzlbI@%H9M-qa^;dJXM^>!P@ z(%5nH49c)Nzk6o!(R9WL`_(U>3JX?Y4E@ zN32``01>0IrMmqXPQ{G~FV&!RoFCg<#LvxRjsg z9Ks6Eea=d(sX45@&cF;zmjg8b$7a3ey>qkrQKrjOHq~rR1Hk90F3D+*e_5PS zkVFxo_jw}t1u>4(H`qw+yKXd~@3vn@bgqOWJp$A3NY73eDl zdul3B+uYwh$9&H$2+&Li4aB!Xe5KV8l92}k}n@tZ(33D2H-mlD#+dT5&rLn>87On(yweKaEp96_J$Lhy^%V-W7J-p0)&$Y< z1SH)>6_|dgpD5Qk48T`Fu|E!@WO3S`g8(6^Cql*+kkn!tazqkow`ibw!K~d)Wo-9e zMS}VR)WGRY{T@xsz?Mn4+0Yy2r8YzGx5%xOzG!-d7Ln8>E!LZJDs9!Z0YpHGMN<~3=Clkwth|BzIl#C>JF2A=wgq1l2@1ky`P zvx}miG=;-uOic^IU41_y1l$gIIw}Om$A{&=$57bcND30Bp1O}_mrSHe)PfDhi#13& zx!5X`=BC1PU{+p49gMSDMM^(MGl2tO(lj!<#(x*7Y70IFTCwx0HiI2EA&zf_HckMO zg!2KlrnVN$>uRGA&NFwvujbc3bm+}N5i6w3vHL8Sh^A#P0PXFN7HX*kwck%#NVbrU zW?xJ+Q@Qh-3iz32whphNOmrF_-vTa^%i+!s$yID&HgCc*C#Tg9OAC$eQU%NJj$FaWi_S)qf5&L3)uN4IPdR6c zkvMp(2=g57cqMeV@ZhKXQ!FNSJs{uQ%2TtlPQa{kx%&|wRo>v^ zW7mzwd~97QcI|IAqPSuHW{W})zja*5-t6n!L?+cgML=S?62GM$sI}Y;7Fw+o0(_G0 zw`v6ZM`5Ig*r(llE90-Wewhz$m$Rkn8Gfg;ltYKN;`gw?k#zcpn?sdq#e$H-sgKjf z?gQx#?aawW&JoI4B7z^VuyjNvcm0XVfyl!RlK=?z+e>lSVt1(0K$a*CG75La!L#p? zGO`fue7W(XA~JA&8;hD0O8c8U&|iNE*ElO4ooBpMXv%6OI-qrt^8{%7`J0~N32w1# zh39;Bn$wjAu8G!`(h>d|dFA?aO^2;!le95*^F7UWO*Zo7T74aZF&NrZ>P6tQ9dpxy zV|{iJ+&@3_!^#aCR5==dqj+3ze+l~4uRev3!2Zzr=qgSi)F|a3iTR(qc+A%}O8pW{ zbQ8t6QWfu|tsHk%G`EhrrGKGSC;(?HBA+8Pz(YSNLd7UrlSB)Uy;l57A561?TUx%* z2^(r2woOC#O>tFzIPn765X6o`*yJSm(7^&PD?MC14?pn{*ZCw8V8 zUxATa={S6_U92TPE*JHx5s#DJV>BbYOAp{70Uk{v2lvxf$4;k4z8oZ5%m?}-NAiZy zZzbgA;0wS-?trrSzB}}3^Y!zgd4yH`$Njk}Z@<0j?}K@=k%5}-#XHXw*XkZIYq%aN zid@k?34s%eCfJNn*7zd-n~ko-r=Q_fZ~Il z-{GQ-QNJlZ{z!+}W#U#w+fa|fT$2{L*M-7nxxQRtpjWcR(PWuEU^;7I6o2z3Gu2zCFEf2FzFbf9j1Y?F@F1`M#IoSF@Ot}K6;PXozgYIQk=>yWn_w; zmLf!!r8dO{5vSI#Ex!ymTPuR0XyB$2g$!BNTEGR4+#;-h(iYf_|JTErqIQKS=ZBC= zt5j(}J|6WHbtSoBy*3Fe^NgNBG6#sqNO(=y9osF@p#8rp*!$RTFHO8@rcM0X2nY!W z3Lx?8<4T;1RU>7(RY^k}Ey~yezmfFYy_;m@<-^KNINQ$Pm=r>jlk&>7B(E({3l9Kq zPUy4Aee~w)M{iaVzT#n?YZESz(3UAml>>rJc?fFt%F!R7T1dy;tjvKce^UlH(oQ6( zPn2on9#JB1=_y>)mZ4PS2`G-%34Ryu(y8q_*W(KG2W#NN;!?!$2%U@rrB$H$gBkw@>a zee@1H@#;tdON$VKXn@Z32`0tAKyzQ<_^Cn~hOje81rB5{R-!Qu@ZoKp(tj90CX~2v zB(Cu+uLD8Bm0RS4$CMR8FBz zHu3dP-+2LKB&$}U$GhQ1N{lc};1tWsD=8~}=4o{Cm*KZ1u6|MofLd!5lof>{=G-^< zm2CS#|6mLC79479169?S;S52oQSk@whRlcUFWJgbh+h?kL!Sv*1F>S*U#f=H8x;iI3OY8E(%vR2exB&=b~ACwsQ; zi_EHZ^0Bk4G=JQt1=49|hU+i=M50sR;-Lg)E$CFKjwX<5D1_a3tiP#d3V9f|QC2-Z zioZlOJtix}HE_$F>jkGhTD42|<7 z?!Ktd+ZUSuAoIsiY{n5d`yu^Qv&7?cvo9pRgN)|`wb$hVA*ghJz9D@t6{ZJRw1Dkx z>kOC~gdlwRa%gY1b_1w(yc}N;VFFggKl-A{7;7y?xJg7`vgv&_nd)(AIsavd6|rdz zrJ%nl<~7;ryo=X!)-~~|6Z!i)$iAf=I)G<2RhmfCUcTt}aL14uROWiPWXzzJ#rEvk zvl}1z2f0{EyS|=T>X$T|M)=aGp5(N-B3%**A>n^_tHxzyZjPJ__S37r_bAv7B^544 zMDyymP@nF?|Iq>zs#3&%>c3U$7MJN#pHAcZh0To>VxSz=L+%aZXW=sQ)z*gi!&R(uhCuZKm-zkPd?SyIL;xT%LLd-sXg`kCY)#>C zwz!Gd!>voc>CH+QRn$#s)p^d0G`ky6lG5E8=YV^!&zQv~T8yM6GGDgeo+xcEwz$vc zf^ENvy6w-^0XP$|s>|juv+m@>-SR_XoGUp-6J)mdGa_e$f!g4AxBr)9e7%1h=@eIy~m`MeOCSpM_>X?dP) zwoaz*?B(g^NPHZfJZTpO%J!U16)($cg|8C;m(ZAEVPP@Y@o?vm|F-sxo*vUyhD%Ju z#;;sy^a*qjJLdQx-1iKFFopA6;+tfGxus=#4?0-Fk=Mwis(FVcA~rU8+4p9|JKAuQ zwW?LqjVX90Gq&s2GFXiQkP1xv+cEuld%d9xAVL@RDa)Prc1cY?u|9H8o0Iq_P=OetzL77ck#$ zZy>EpO9N6zFgCzUNrlz@DgYvt7autRmu*koc{o#8S6!ueiX3&x>rix!CDe;Nm9)E0 zJ#9}bt0&vLCVZmkW9mE+htYg}Vw2O)3))HM;x~VdHEAEgZh9}rIz&T5|E1YpV2j7; zgO5o^dk+Jn(q3*@(yx;A6-pC494Q>>M&mq0^XaJYQ19#0#vDa#aMH}O-@|@VdCl>- z0EHB0VDyX?anIaLwb_WJw%JyudltF(QD=?o?i+lUDDGH+Rgm<%_wAF?^;!ySd;1N` zNoz5}_8nPz?xSzpJfA@@U(KBiadxDPeaZ4HrD3ZOz5lx=XD8n(bfB}>N9K^ACU({1 zEoUX~m&Rjdx?q#-du{btTff6zhB8Oh7Nq||L1_w9uGLE_SfT- zi1#IO0y%W6+c8AwPfn1vy`|0jv4iGa4v*6jew8Cv)7LsJ=H`&T;Afg&ecSYT45iOI z$>%b6ldeWtG%s*1l9*f$)Kc@H%i?xdm5K>9Ld=mKEfWi(D}flO&T5Oz7af|mp+oO* z1%=%Y5=JwHlX2xok7pcL)=A|mo>n8gF74MWp8f23pe_OJsj>09mkRas8!--O@7Rbk zt=s;7C{#)w1Oomob^#U{A8{}Fey4eoSM12=i~afZb>x{G$)Kp465g9zS&w?(JI}SQ zh%W51VotC&^}O{&kxJQ}8@9}o695XFt#s30RknZ=m!(6A$5|Glwji6Uy>f&vCz|UB z2#FNO|NLC}rEI4bOs6h9z++n6+ozbRnpw~pJuL5hm=1(a_J^PzoccZ+>OOveY%Gn?t55D32ljCgZ)j@EBiUbK#xKYDqRy&1g-+ZH_=ei{dtEDUKz4= zyyF4trT(-CI0haTVVX*@M7~c2x+&&UeE=s%Sjrie#}x#iK6}K{SJQS4iwwn+Y)+Q3 z!dL6@(cB|qWPJ7uOqN+5MOUPks&LkvH*}p{E@PsWrZWM8GKPwZp9P8FF+DiEy?j)D ztu#NVOIPRWFfe2dWM@vi6}(O$`58mTEpJv;MKnm!SKbygo0Z)iQ3e<&6_FI1nt~4s z@ab2sZDjNETIJQo9*4nIkIXG=#B3xne~6DLh}5(fMK?MaiD(&|%kKTOj7s zrRjqr=GMvPx~q8NXQpjc-!Wdc*?w~n51{Pzd6t36&E48cgOkYo1KP9nJz$0Kj5-91WjsI;V zN3%lv!~M;jr_Uhsmz0rbW&N}u@y@@`7GgE`KeJZ88q?6u6)`O1Hrb(3C0BgC6DZ-r zYqU=DQ=?e27}u;4XrmY`NI$_)11_3LcLe*T`z0c2(m4*>Z|wDo4a>U5vk$RSFRr$L z*_vEpr~xk?<|^mPf(yO-h9YEdT5Yk5?)?wif@oZoA`)JGBsi!C%$vBrF+>aPN2Gq1 zSbyZd)?+g3jc{LqpyoQns0(N`TVHGZ*XQd!o@S`nj6oPie0G1m%{<%%W&*`b@c4L@ zs4OBVZDd58-F*E3&}i6kkLtH<7o|s)<~9#|XhhuLVhBAR+YUvzCedxcD^uQF3(5Bv zKpXOmD_(>vJVearIz?JdXdCQtKBX%v$jGlCSbcFC}J33`Jjra9TH2!z<5^u?vp(YJHLn%H$zn8n)nSpKnl02nW~qC3v6 z3*V2)Va&=o&7+P-2-jWzvTs|i1m}ul z2G(D-TLz*ZJu-7O=8^j3wM-25B$9(|Tnd=l>}(4pR0`F7MpZh*J@Y?urAIV+RJ++L zn%R8%ail-vlQC;XZE0QKUaj_z66t)=5w(Uj;8-(Ns28scQnK6teNGfDCLD`4+LBF% z5c;dqxCLN26GIoepYpzbtwqH7*4F6Yj*Cy_F92`NLwX$#rHXRPyJd-3IXL(tp0Ru` zQHbJMop_JHg~wPzOilAVvb+I!CKn(ds=gq<$b5CXt;;k3{_Tw32_P!7hd-x5@y*zz?4Z zDOdU%MQXjTS!rwYUku26lQ0y^Dn(sQRs4!c@Njr^V@I2t9V(qeD2sy7lX~c!0;PORUY-+ zw?G^^bsv@IwiM4Xoid?Z81FB3t2zcv{O-(&1}j5bUd}$Km*J`n7)qA}VURebg4z$r zZ(H+f94Wj2T4;E>VZk|4V?sL_7rLc)8)9HB2B$%m z*C3YNz-^E1c8X^!A=aZ4pX}?2R%>w8*fGxC=;cg9Bk&ooe+f{HvCP#eac>RBwT_9! zuP(OG7wNw)nXW72IaoHyTePLtd4q-OWj&S!=XNXiK)*3-;eJie%CBpmE}-!^MP)=d zaHda;uYVDLNLX&L=atPjNEdO;DcBpr*;DfwR#U$CIa@n5GC@UszTo_*Q-G!|;eD3q z$8nS|GT%CgtMEI7-v)NU>gf3R$pyqB&O*m#+rjRRl=XhtT|5`3$d*aIwzM9DU(u`N{12qn`lO>C; zc31T&cJN)#%PU#?8azFc^xX&RY^F)4KP)f{fBZxJa6R2IeAlm5oTbEP$mkj`(n&_Q zjWVDtc6%g!b$7Wu!hm`F+tFNfA2BiP;5X&Ega|Xcw4Vb5I$RcDn5hX&9ScTj!W32< zIwIDpI>41?@>ak=R2T9W1yVHIo-xx<^>KF+LV38{QSRk;f7hVwxk2c$1xQN@^q1nPa7C(&d8JMynAi6H{-7i(+4$`hUewytE?ig?gcw@p*G1bCQZI~vM@Wx@#}9U zn(ge9eI|v*o9Svv>qQ|$jiKotnm&cFnuuEE5EWfygGRW^*fu(H5iNiXbE26B9iO(G zjVl%UJ=`b4)0xJI)e_7f`?_aR*ze`CH8Ncf7G~KzHDXtit|qnALJ&h|L#beY2{Sgw zr-ro_(N6ef96`#F(DMkEiPGJ{h&@ja`@H+!jl}ZIID6s4cEgh-_V~VrWqZEND8iRY z1|7bS8gN90xXEgaVBea(0XI7=-6rg`|+xwp}NvB0Ur_nshsiv`8}KTe7L=#&5)hN|1he)eAm6=b<*3ApWn$Yj%6uoHU~ zy4nEd9+LI?`(eRTfDlO}b7<87n(`s;OVzFBXIarV#&_4}cYTd{D8H&q92V}{uXk8o zKr9iEc!pFV4cm9nGX&}^(9=gL(bxy#@Ek-%xi!)VHjvTx0_+v1`0Y`*@0SGt&u}~b z{#Nslze6h^1}0F%a}TXSZGeDS9c;J93oVd$Im(t}D1?VU;g}6f^?NtH>qI$q0Ldu) zK&SHl0X+=T>nVI*h!G4PAPowFJr!LgBpF%Z#;6kP{-}+tEj>gTbiBOpO)pt&e*7b1*k}i+Gug=uSk}f(*H`zZ* zJ4jwzD%NtDfpg_w#I}6M#wwnCf~)3K)lCxy-1VdAfq?#3&VvjaGf!*^7IC`;rVg=( z)}HB#T_?juiVvnNL<9{;8Wszp9NMqsk~mE`?B)<$`sxES(`{bvF1L^D z9UwCf-3y%Q-n>mlqURt%z7_`>6leWAkcjU9eR4{QB2y!1PlYPnlHc#%+eOT2Up)&K zBbv<%@qV&c!x>=UlOlVy0(?O{c zF^JC}8VKoY25QAAiF1>o2y)atmneY`jo#BTYj>kw?Y)yFnumx9kpt750Yq1u*T&IO z+XlgETny<(Jw-Am6~}U=3Fj^m8x9=*uUU`wuyL&%e!|};d0B>!{0AU`G?c;xk)WiF zy#7B?W&$cW-=@2=nf@$3NWk#Tub^m<3z%xkQ+R)cc=-Y1E9V|bbch+=bJQQ0xfi#m z@P{EzNh}V`fDl$nu-47RK9&ZdLI~;e+S*YeB?W&8qJE%*o$Qg5PSk0zHM~HVG@u`YQz)?$DEY!B+zI z*;GK)$?!U8l0)9IX1bT_S}!fDWYL-s1vt~b z%wK6E2uK(DoHeARt4nI4513PY#|g2__NfOb74*Pe)_ppVfFdvVN4$p^GxZd8D*N(L zG%oqC4*;j4y5xJa?{Az4^ufHJr4&Ss__@M|AE~nwyw$@;>dZ$1Sz4K%zy~KB`a-sC zGcGZqKMBZVU~v-zK;3oxCBsye?pgnG@>>>qugk#zKmn^UIv`%kBmhioA=v^y8$Hmv zq$&Ut1Jda~Q%c57C|sZFzGAKtt#XXr_19JxkyzImzYuvxP=*dHH~f6t7;nfOee@#W zlNg{6*iY0QJl>xhNaq+!zwMIp*b8}ysyh|~NURS z0PTf|JhDmhPlTf2df%qWQ^s-n-a2)-Nq`Z;M-tmkgUrrTvzAIjMJjjJWumDNZVvL! z(a8`pdXhiXZ>a+*z4og85!q7ZLn{I3lpiLQitEZH!(DfOjlkGvfELx#H2pn}0!%Bh z_fE=Oe&H2?W$|}_DvFAT9Db zoUNZVoM%CeCKSM28Dv4k>nV>kN4?$r;8Z|59a`tdpRzAW+*!kA#U?C3@`H3m#4LjF zSHD>xUmfQxu=X7HW$OBxpx6cXuSW&)jV1v8!C3kgrO(;Vt}TgK03b(w#>qPcGCY58 zUTF|hNPqA*7d|r1=m&Z*IOTpqT#XG_&nOsO2~)bOG#;YJ`j0{jz4f8J2Szb1J(MbH z=q!E(rnWcICeb{H7-bC1e1WEg1yzLYhart=P9lomd)S8x;X=NK76`o7a#PFZ5EO~c z%PZLSxrB7HR-Kg!Wif8!IMLH1(@YfK{${O&y@5cS2LO)wQqER6YBv1Fbl>2Lk#)$n zy^Y%wfpD1UE30o7CVuL5-4z0eZvX}LJ3#WNGQiVDMi2m_$(a?AMsuCfM)=&gUG3G5 zMH%XqBFCE^QBHdL-op#6Z%+WdA7|@P zpay3X1CoSm$aN`DQsow(YM7ZN>vLn7z6c!4lpM{0OA&^W0rKXkF?5QL>WTXet~f){ z0yd+enwYM{pKy|+WRO$nC&I((?H8MCwA8zC52L?5;6NM2#>H1Vf68M9VmLP zh5mKn%VPX3Bf^V${xX=L&%H4qFSxsWJUHb;kDLpif*64*^iW39GTQRxr?_=w?oq>_ zD3Lt?(*cC26q_Ky2OOIyDj8~GJ!z+KVOE(63X=uV%mHPG-L1bJoJ}$p_DgJb?YEkF zt&XvS9y`nf#_b~Spau$4jDP@tt}5Hfo&H%s8trcklwt*s(H4gM*@)2por z#sUFhL(>-`>TtQJqX8^ZZuZjwM*4?ole;WBJ)^nv19kwLG>2>#(8a(8+?TAlJ<+-n zfOhq>fm~CQ!ZSl(yvk5re7F~S^AD{96Rhu!;y zAXX>B8Q?ioRrx;ifXLiGGY-DfP&h$a(5A7Y?xC(uL2>qzV$d;%y@9E~!h2L*_y%h$ zvqul$Ny@d~SG%^o?dR+5qLrAeQ4`)X4O7pAvxnwk5|S>jRYF9#R)66}V$#OlXLxL; znveMFnIT8vKEW~_CpE_;<>_bI$(XY7J}jSciNQsu3BQhw$_3bnvgecy%WFxW^I)Is zu}FDd4N5t>$E4O_$r6eBAzw*YeyJ!?5DNpR5)-IyCZSuZa7_ ziJ@~`{0r3j99&ZS{$K`he?^<;!2W}eLoX^dNwCxD?H@XbQ-q;@J}1(?kvgurc^b<^ zpl5*9%4h%AkLs@W-A=TZRN}s^&w46|?tx`4v++3rBx^2D;r;Vz^DZUM0d#yuMf8{a z(U~M9&Kt!VHBR$<9$pIqf)_ue?@*X5r&tEOAAGVI276|^C|>a5gVoI608_?CPR&UN zT(4U7wrzL?^^|dx#6As2d31+PO-OF|6@gBpxU;XDTRroOIwEu28w`|ZPf4r2ZTnP| z3|6`zwp`)Nn1%R*wcn}y7J8I~foWf~7Yhfhct|~33M#pp8|>o7>S}jbR@H93)7|e2 z!Q^w4*nfH)tVD3Wk_O^H^rk8YL4y=(pAjeoN*LO@X(l)qcPf1IFU25yX};>b@hk>M z<8hLTu$>);*)w&OYoHAZStcvp04Q=UG%b4OBkV2A11G6Rua;|8ghYJWtlWl8hWdD- z7~Pz@hc26jkp_97RPn({M>atIhl^u@@0RbU@z6HETbc(zV0~rtOW~tbAmY;wbA<4P zo|-DVXi<&?cVA3m;Q8(oU`timXFNB^R_t`xK)&eswyTh|!Lri0Z05}Sx6OSVYz=4g zH#9qrw{03s3u!rjga5K(W1&fN>vz!I4)@@BV61o&Il;(l%WIt^(W}Y(JNoku`{T&I zSbi5IQ?{c0CH8FJ%gojq+tk~y?}IX2Z;wxgf8_e&Tf?lH#P0~c%E%%kOJGEN_%A%n zp2zvl&KpVtP=U~BU1|fH0R+$yeC=ygLPbe-hWm6lolN{K zjLsbgy6uht){r+C7fOhV$-**Pdl23Zdj*qVsx-nvK^uSml(gRLJGS)oPmW0o9I2ka zl21e85dMwOo~^E@_`=%-@tDV9a4fnv`iB63920-w@srleP3yYDYf$#hhu0j;;wTp* z`hB;p!SDSSf_Tthnhxfo)3<9C)BQxL1ZYL-pFvd|J3X>=ot;V# z^zjkN7G@1i7f+}4!sROIhFNfFG)d^x%7HbIsO<^*T|lY#Ig32R`CJ15A(n&pfqqErEX8H-Bpc^-*id!}6+X7@f)6pbmMuLeG8UtTd^SSP5XPesPs?y%S+;|Ur2guy`>|;mK1fMz~ z0Maa;LMpv7{RJMbVX2Mb8>qUFZ?lpCxbz_-op`Iw*CXgRkAaK1LI*218w!6+^- zV0?$Qd5~ZWGbq0MHUfWB-wG)0NOV711}Q*a`!BbX7Lb3Mu`R+$X-v(d>@p_w-e%R4 z5HJXu2_c&OM+*R>4<#SuQ!{d;85>8#P5`J8OpjmdSzZP0@fhX0`!DcPiCE4f*8HD< z>nz6%@|D+`8C7XHmfxBOobTj;4uhJq)!nm9LehkId8e2hX>MtDJ%~Q#d14(d>5ezU z@ve3!wN!qM@(m@YFW-{_ixI3<;bu1%{`Ay;UfXAA`3bQSj}u0F}PhZb~Hi?&OfKkSXAl z7on@CRZ);5albPz`?y4S4T!|+)6bhf#$9h^?Q)?8&`x~fhouy#+LmLc^A0hMQ4dlN z8`=G5Z{}(&)pAfqe2ZSO$U|h(Gw1QYFAzbwqn|P3^DqR&W#IN%Oa-!;CX3)jXWB8$ zvpv@%N1e09S<_Mz)d9x;#D7+OII5cGb!C;MU;h=hK(!W0ZK^ z%W7_Q{Z^?-QGk_TUs4h6rKY9^mJu8WKhC{t|CT?KL$ZV8d|InSjkp-$Urza}1Dja+ z5JjbrhL4Abdh#NNVD4Ax?15ju+gG?aYr-3QkvbXTL5FdfA?Vw5JP9b5cuscL9nGo(pa@p z6MAD~E?3S!>DpNGi_fgH^}O}80L-bTx#S^ldeL-z9VFfN=jWpk1KITyGw~YSQLHeI zh6(HOy$j5mnx=|+;>vNb*l~A}u8|_eH!Uaq3}ZV6ejfFDv=|RXyLPy3hFEfTw~Wj^ zkuanadA_nq?z!JpVOJ9PVxKkIp=D$(Y0h@}aa~6cf}7g@05pa*g6t})qSqf@GkyY; zRp$Lbf2$(u`M71o(6V)C*f2N`DDW_n*tm9f>G2u*-}_pOj(&RZkzTbbGKmc{dQpx7 z5E_9BMb%oL*HadaiN5zdiO}xu4Wi(jCT5JAq*`zSohdQAUeHFdE53xy)M~{ph#8dw zCPX+l?WOW?aEJ2$-0na^CC$++Cnq+3f-cugsX$mlKMGL z6;nnwe=)55=B2$WFSD2mK)sMSw9&7beCpgUBqzhCivqG86gxn^!?0Er&8k1^^YydN+}l6$9$4lRsH{Xdgti4 z-Y@LCNyElyoJJGdw$<2nV>LD#V`AI3(KKdb+cq0K=lgq~=lv^dWzDR0&pG#;+4tVp z^|=JAbTeeXAQT?e(1Sm&Sw_5e>EprZWo|}{mhO_zgxBCTzU7W<1!~QQv^~aWMVW^S zdRofsfPbKB@DeI?4t>gaU95VT+D~IL%QmZ!9?j%S4e9WayCwo6m-sNsS({|KL_Mgv zXfd)@Mmx^oj_^nHA7tu?d(?J9rwdgGKd%K+G}h0Zyr0YV1i2RuA2b%j>ppl?c77qE zxZl$;gdGRbewE-OTN@xY3de;y7CS>(OLxHD{fM^#P6n)zh8I1~liZSUTQN6?E!Gru z)sDGG6-YCLM}^Rh0ZUS*JfhTxvGD`7dL6FCVFQuai&WxK;c3ZU&cV}qp;`dlY>Srj) z195S0n7~sG`whOzOMn`}|FC2-NFRwKd1ie5rypS_aO)4F4JiDJzO;Zf^8{Q`Pmm z5vPmTlFH8E8x4~Aw@w4ol_809WetAZ*<66oC7QNmg5k#)=!m1brr)0=<69~! zS88SOS_Kr-P}6`;L)twTpa-~EW&G66xsPTt(w^vYwlpobl-9eO9BNGvunbHHL@kfO zX`>lD<)@mW#q9PB(*Lq8-kW#l&Bpe!OLCIGm1L480DFglVA;dtW_B>zoMoaAk2AyK zqC??*za&Hd1!&LQUUW1Wlbm0+9zC;v4#?N**aM7a$vd5VkAx8s;5}Sc&_*OGA@`I3 zZ5m(z!%kDzJ4sjps=}+_wkI9?F0=Z}1sId{YhY55G_k$Rw9$0xcPngx=>Bn&J%~TR zuFgWDxTJfy9(2=Gu4&x_p`-BfW27+=P?r_!6s7@gH4fwL&5gfcOBKqbc*_uj8gRbt zDu|*jWR8E|G`^IIdbvT;^{!QMrz1-XP>#Rdb2`4ZDH$tk%dVTmG7m{B#bZ|g6QA)i zpd=3onb{km1bpIs^VWqwtgX2B5d3s$+opOnw|_vT%l(0d7-34Qj?-sG|# zG4kgg-Mdl+px8-t0LpYxM+_T;7FQ3E;;%!C37rMJG8f?`5n=c&RfS?Fg>UqA#~r2$ zzY56h#w0ACo+;52GU&-&cK`bq=rD1X+!fe7qc<^UROlGHSSRf|oU(uRbEIqr1(T4) zYlqor$hYW)+tRkSUC*rg*LT~i;5#Pzse+mI^!cm|C?wZ(N)&SJAb%xjA&_o>SCizU z;ALX0v2mY?mpj=Z(+gzCbx0HJF$+yi@A`>H4yJM#EKNgRG-0u1FyfD8O z8XN@VGm#(N@$S$+@#5+66U7{^lezf04EyTqU>tl_gHb!sYKQ(x2e4ZqEP7iYG)We*4+1B9@{ZJ9~{e0WJz+e87$)H zuqX@9!l?!p@pGM4WV5cDJ~?}SH)UdgMtk;HrO+-V0m<%T&GlD*qn(-rV2Md5TW&az zQvDe<$DJR<=6mB9E)1cj3hE%jA@SPq?jT4@X8zfF_Iy=vW3uG{2cb<{FO*=}x#iAv z3n`Q)Bs4BXI_n@v7L?3uGA-+`l86H$g@eUryE+)sTK^m%jjt5ahPQ`RnC!E>Q@$^I zr6l#qShbZk(ee!Eo8Rko@wdehAtoa9?FELZ;YRn7(t&D3_d}H; zE{5-*=9uS+XQdEm3P4&nxJiB;G=QGb)-huQN0Byo-)1(3RWR-p?FhO3S$Kk30Z`vK zN+Tx4PuuSfMu&m|pGZjpdj_GB!hOXZ2OO-)3zB(d0i6rb+)njM6NBmxDPtf*U`p#- z!<|B{56?r)&1&QY)BiPHdF=UMz~;a5&vZSjyXgxzFMS z_gshWc|YX=GH|*&-$aa&-4a_3s8?~~kK3u}ialYVrJv#ZnuAwV zb3-9Hi#5KXC|nbfGj}VSo4V(LtA~Q4C!PZ`yK>viM z1<-()J?cFs0yKcm4l#nB5B*T(?e?6at%fODap2OVJDV*GgztUkdkr!|mF|>MlCMQI zCxOM47?d%XGo}(-THc0^e)%i{ry-0(~SYB0woB$8=ripuqKVR3JY%ouI%#p4!zU)HJs0Nz1c=!BPd{(T>1l$C8XRrMCL3 z_&~5cT{>}6Y!@F&E0A!srW-yv7QgcX5B9i#u5S^i4LA5`P)hy!ZBD(u-6f?o5YqDL z9$Wcxh<)eBif+X2Zqe2iGI+Kvvj1X@1m0R*CW~-2I=$-q3P*T3X~Peuk;c(lGw@hf zAak1u7+X%%50Pr85G3Y57raUCQnafEim{*)O`FNSx>X1nsU zXD3Xn&lw!f{73=%R!8BHEZkpbqDwf)3fYH|qAJCgy}gA=M8rr1tCVpwK5(D7X`znZ zGTWTfk3okzyweU0A8u{LzK+2_GwdW1A}0e=2PxrzP3GC*!LJ%6F@-?)jPm~6T8hPO zw4riM_t+OG`RV!fy1id4xqD;dAdUgF1$T~bxv4gnuGC6sm^kqa_IQSUhrCq~Pf}Yu zr?ayzDD6)HmK%E*n-n$X)`Gi2ULj@mDWTii_JgN>hiet!ZRj00@)+8l56nk|#=Yc` zVmKJF=}uVBxEB%+eT0kf-l=TIyFJ5xAg|AxO0-zz0c8IeEPbm;QVZPUb)_klYn=)6 zfZ>oyyqMaO$YiyWG zc)`ARQYa3W$|{HjAJ}&qJ8X75P%e8cC+!cjtqCYqX^r{ZyNuwec;p@%f-U0|I^{I! z)bB`d4oOGH##o^i>&?a{(%; zb$e1#gD3I_FY*D9^Uu(Ald!DeR#(xT0V6w=g3yL#(k|PSo+YlFOZYVPi>E<((7(op zs8#dV7lXp%35%B!^Cc0W2ql`g&-)Xn2-#Tt%ijU5nmgI@ zu6)&I8YU$x;6Ia_c`Uo>`x`DCD`R`h_uWk+P&8zzB-aj8wWjzTV}KI_O(C1}+s~#^ zQDR1quK}t~f+d_` z>iq)-Z!H!{0o`LJ=mfdb*^*ci4qXGqfL1%rPrH{2#-E1st!ug#p#C0)jE*H?V@KXJ z+sdl>;hb<`xj;QgK9t6wrj z`@%eg0G9Jb)k78NY#l+D*-}`UV!%Mq5^yGP*=0DkYWd}%Vo0XrMhs1^1(R%5UPF$X2)zC|JV{EG8;7pmHklgY-1;tg($@iyx764oDKkP)_{DdBraP#54 z+)I$>WB*SQf_n)}Kr7U7&Hm@vcnTFNq@A^jq{aS1FU6dwR>?`cNW(Hrs>_{b;Y4uj zVa>q@=j_ypUwLh3q2D|B6DVxt53KrBh|#yYhIRe9eSzTt(y_#HX(e~Im`o+M%^FL` z^kgwE4Tr`Q?x~aXdiA-XN{(}?0jkXJ`O;sHrHodtw4R3D6P-F$U!dM!RagLb(%M(r z*O)RKt4e3}$~b`QWF3OwURqZ*aj%A-23V?hYv`4gTQvZ`rqp&sP0^22b?W7gmboNn z0}cfZ7W(aC0WjB@0IRvzhR>$y@z3uz0G?yDW~Z!u$N2v+(L9QvQ^LQ#25 z+v?2q>J-zq`n2Mg+15+tmV}*=^+2I2x^9dPXP`;JZdco%T|mU4$n8@Q z3&muyV;s#ankKPI!kel}tniCKh#oEtf`aZUO_9q&07Xc~TzE#KBFP}xP&B-jJdfT<3u$+tD(A@2^Z8PnzqdnnD5DSX$ z(jjsmuasd-t5NEUq9mXSoRWX?T4YEI?0dfKA?K%+b-KBcTlvQy5;Az-;1~omL&2@Z z(`ZSCYFYBlr*X=KJ0}&8)$Ubmkx^fElLGg*@U%MN$tp_inD(J*Bq`F3 z{8BEA?%?%;!{o}>aQBND+oi_>g}Y^Kz@LurXVBm)KKl4_Y6C_?6gS@&tBnlxb(rUV ziH?b2OKI){7{Dk=Q&QOyoE{j$v|0GLThzZLB#}Q~cB+(L;jw`7b@@qVi04$>^l59>a$VY2h8x#KrG#oL zRmGKcO{LgxwfgzfurSE3*QeqK$%qnq|aE^ui)F6kH~i_t`*HLW%@nMT+;q(ns~ zWQi4FAS4Kwu8Y92n+dQ|L$edSrFe<`3_w{t%);G&6L^En*@?!X5CbO=2_+N)q0i z;Y_eB0}Y-l3JI$cVS9SX`4Y%?G1!&G1o^ZJO_Mi%lFMmnWRD zZqlJcDwYeO{u_;DV1C8jY#83gjRaSbg6F`_YmaLm^JWrh4}jMYAfdRh|336gqp|?| z3yrYlFwbFpg+L`nlio2)k?zdUOxcc#BjeXo2ADcP2mC*&>h{{4e`9pp=S$>+H=EM> z>5K9w3P7;spnCM0+w02CvA2BXni3Z&F90T4=Kd0l<&4QOT^S#cIbplTBi(m!_S`qv z=FCFA(pwyV3-b+MGgyjb4FZ2?14qUcbEn$M^hmIORWt&R_RUR$Z01vmu{#PsQV5wd$qS-bfb|s4^cI>SpZwjzOdGInbHE8OQWY)*uq4XSwJI`R?T8+4TP3w6-=Leqj^agX?;y33 z8FPgiuJtoIMgIwfL~vunRLvYi$jn{8aBaz4{e>PHLL)qtwM(>R1-QvfxJYOZL#)*Z z9?aym&Xm=P3LUExgv!tQ^IrYOYhhgDg5yGx+N?ZSTwPiGp}y7k;$Z{<);fH05)7gj zg|0Z?N!z+1QMjG@V+KRs1mcx&^a(srDU$YObyjxcSi<|WlN|9#Blrl1nyWF(^t~hH zOy0^I9V70UwY*Rhz57`ikN#tcYQENie71FvM2tj^mKC$4|9$J-Y2xX4jH{Iwwuy|l z7gd%@i}`{Zj(wP9u`-b(8-GX`tHZpZ$uy1?KBf=-5A@n{>tz`c%aodrRgS%G3ms&o z@KONbwUW2XkkIj<={k4Xmq!XH{y_`Zbz5Dwxym_NUcGtKHRsl`4)FWh$65j(w$iaS zOBE^h&sD*l#af8Iy-h6$oYL!uMgq~||BsG&dQ5HPOu(!(W8kf%{EkngDi5#^yd`@< zq-fOgV#LOUs&Q!o{qkQtU2GKdUX7gzT4ck>51-6A&`>XF;3(lZ zF)ha;(U-WNU0rAi%;lgOkJqj3lnw1RS_R9Tae!t zrI7Z`V~`@L;U$k>wzk5F>tztezyKh+Ssg4Z3R zsH?+KE$8ITCq{?abW4aYHH8|4nnvJ=7u~n>KM7O!xmACmH$C9!$ZNSqN`iI(0}QfJJNY*ypB@<)fgsr?w{2!D0M z+jn8B_{!)-*t9{h&FAcfMfwD5w*+0tcZ}QE!*G|_J%W{? zhOybozKg4RhSO#n)ou=4gtuTgXZ55V4#;B#5!AV!N%Np@9fOv2AbbpOIN|JBz5-fV zUr1_6Zq$-S2<4h~L>75vl_JNf?D=`9oKrKBtmhF*il^59EYD9dz_4^4EMS#bxCIIL z^wC61=hZbB`Z$;tMb|^}@OFJhI?~;L$7x?|n5gNIl$LoAu@aTgReU%3oc>tA&f!@_ z8q>z+?85PDsdQ)}@saY7= zt;8px4yC~kyY7bgiBNL)>x`E%%WBl`-0f(l2Sw&3uW5Z!GiyjV*sp;%`7gxxM9jMpMaU@_@m%#uY6;C6X6 z%YhGvl_O~MvCBh(xl4IcQD7djQ1H|KqV9bzy!hcJ#k9_+#aDN`dlIQYtmtY-T|oh+L_uvuY2lE$;>>mH!@~jy$23+J+)jUq3*B;XuZ@^qew~ZdcB`?==$F5|h~XMJ$%5N? z1e^%{{cl+EZdq&$)6QmGeAV#^j~c^qYQ-|Gd1p20{>*FRKgn(cmy{b8{~58;asbNF za_1xSFzx$P2d;_9%AkQWL!=ws;WBrWT9`8;{xBeWY4sSYuQQFtsT?O+x-z!ywReLD zYl`On2uf@SMnX=sHaU|AFK*&4qTX5+N6u#ZPBG+v&m2RgOaw$lx~-xlPZ>8necee>m09Qk5^Z=n zhIJg(N2NeYVcHN|6B9 zPh>red5s}r!@Pl%vg4%Hqu83Xn3qxCWw^n@dRU+U&Vc_Ymy61@s2yy(U^E;cH z_I-g1*NBIlXtQ}i%rn|8W40tdz~WlT;#-p;AzE+i@t9$-zreMRxlfmqwIN+D$)e`g zYnOSo7nGgl#FCM-Yvux{OE2IaN?lx2%B~F}&A5-ODnDLUDbLl$?`1l*n`XHt&&jV+ zgcH$oFwe=H3`$P|FO-;)ExItsD7*%@fG%og2z3WmXPwIxgu`VJlLN2yWl)`f(|2~PVi$}5qxV(GXHA)&Z z!V5(_$`dA_tFBlsqR&GqpQkW4(g3??!y%kGfxesh*=Pcpiy=R5osP+UL%X4z{~gA8 zSXH(2)sW9wshxW$>Do;0v*ULG+?uc+oGrQXfFC+)W;hx#p-AMGIemHMyZ=j%D8+$L zEl(RCdAU!3S#Os<%jP$dFD(-M=;U&8vT9;UT57iu+EohNRv+QR&X0Jca1Jjw1_Nx8 zv0&BOTN{mitXn0H(4|jK51PUOgCEPUCe?wX8SIeGNQ-V9!E62D81gpk2QT9z9*4V@ z5D-TU)%ebq__K5OM3=M^bK-H5A`6Ubq?N~gX zG=J_B1mv*`_<56O4_P$Tlp@WQa{1rCCv=L(%!a9Uw^6-@hZWFVeT%K=V7(bro$a>}ukTjf1)Qi{RC@9$p0NqU%_k2JxKdqQr90Xh)nmSJ^ZVx&LXY9huNV-q zfIKf`NzW8-Q%Z_~*&SmrgE`dvU~2Ojaw=nCW&_9Sn|UV$=((RW1srH!&emAWh&nMK z-FJl)@;ifj zU|({>8vA2%s0?fOHhGayu+SOBTmR;H3}tKKH%iMZ(*z4fd28cV=POKF zjhQzx7E+b!%!^~SFtnz0mJ{Lc002plp33qh{`qV3LaX+as2W|4fw;6{a^)>|rHlCd z;1&0%NmV%ax3NWE8ja@(E_h|n)i?sv0%)2=n=C)chusqsxqpvjc?v{;b6A{U$%w91 z6jm%6XWL5CR6saagfth;Din`>F51c=i-&2Ux+;>vH{D5`{hL-D1Kz8g zP%OuWtfHEF^432Z!7;2Z*d2=>zLSO?6oR5xvJTNpyFM_FYCjPlh0x>YDOLOGZhqxd zRR*12kNxcaxM4-P^r^%rPc%beDZYc#Ot8X!)BmxlIuo@ZmgOm5u2gUbKMW__8v+NH z2Ga}Nr6~-1I7=&OwGh_=F*B$U3Um?GfTX6ctVt-=3zbUtRGnzMobsL2`x#K1!AtVL zI4kc+$>K#xqtJhV?Zw~bxI$ShHrQ^97qQwTpSkRP(*#F|!3M7qqdaV^cHPrS#jkT4 z>=T^5`5^qEzmegDQDMwqLEeA%<(ci3wX`c-Fy{^q`BZPw@O?Fj%{#A~*ju_XBmVJT zlP?c|waVq(?IOyjQW@1G?y_@9!!~iiPh|1GQ4m$@uD#FJJ2ayAoBF!IrpZ~{ zz-Ayb*%!rKDSfkbXX#xT*#LH~g_GJuly+B=CKXm^DJA{;5G|Z|0waEkU0zsY&AFkm4U<26Q|8t~cqVNl;ZJ}y# z=Y~D8ronqshm*(=3)Vi0GA)_QIJjLa=D}&i((=IoEqX{cD3*X0F7YeGONo*XSmi0< z7N=Z8(7q-Ajdp);fn0Y0Nv+7}PL|&C`9b>pKoS9~gfCvV~H8 z!sk1gFK}Ff*l){Ew|mlRcoO;JOYQ8CU;Lcp1-?5mwkn2TsY!&mdhkmQoJSfc){O?y zu=&PIRA9d3+>EW9R(E{&u~g6Wa}xg}^FRv+2N~$k5AfM5~peAxn31_!}>1WJxQ&Ca~=`@ z>ZHaJN?goL#kE*=DWF<`W?oMgB>$d2_Bov2dj`!`JpYIm)%>iRfqxFzh+ZmEtN+~w)q^Pwox;uR2ymZ@v) zdY=$TXka9_swk8t+rl(bKPoWQt2WacRhyRO@Hy6%tq$}jCWcdG13nimtf{tm#}Bg} zY_eYbyI&dtRd#xKpg=gTx%T}4(# z`D-@?)H-w(9nE_;B4^%d%yrz%RtPO?W=<&yXyAHRBinKd8BD3*b2qH9`b?9WzP`|s zdff@NBFm@GmKA#+5=Pp8QL{u36tYC!x^-HJiFM?AEA&L^>iRu#QW6X-@xQD8zdz(8 z#HQ(SmN$skj!>oU+nlFk$leyq;riiCb90y=5Wy_{4jNi~SS({ke6=0teqX;t6iyzG zCDvkLe~4^SN16ArhZR!y$=v;Ejnwnj=ly6?n>jcm4TE#FTSuC|Gkzb0xTgzRGUNS> zWo!IqeBC4hk|}Rpcnz+;m2mbn^HQF&|Nl~e6&WHxV1q3e^N7g@D3u;!qr}dVZ|Muk zp&$&zg=ooi9v7RM`gpif0RdR=w{x?ez{y$;qiR#8@MFn))tgxZ%V(w+!P2b zDy05@7c`$oeB&p=O>E*C+V>hEJcgR+lTDwB#XAtAaPA!3(S2yvH%CCsA*`)k zUJb{NZh7_zpHS~>5{-Bw9?&)Nw>?=_OCU5~rq=Al$>P0DM>jtm5o93gEsNMj>eL^b zNW5-sdVadzt&KBO7m&hoc8Vr;MrD~03=hWjE4gl!$PNDm4k8{k`c7sjTUC5`STu{D z3=ah%EcCxelyI-ne=E!)Q(J@cn=wm7CH9}!1&z=b4Vuk~6X3e_WB zHVyc=EBtPf;Sxlm=+b1Y;|{+HSF2-J@;%Gp>B?w4jTLsi-36mmH`xDy&#aBZZ!y0! zIYTQ0Z0*&=LF|TB$r4Sxg$@%S%8aPM>ve_AI@2ENInkSGA zn#o7x6~vo2i<`7IDispsgWb|T{^Lxp7IW-+oeAuJgi(MKsFY}F*ADd~Y|`mGUm%RV z&m@Vv{U_m1j4PHZ;;m1R&7(s1h6P(fCr)Ai+t&ioLeq>p5&zJ?6Q*QDp6PYW`Ib3h zt92;kc&<}ivanQbt#N0z-@9UhvgXgtZOIP|xWCM6F#PgMq7g`H0nx&7Mf&p_pFhB? zCvK+y_lm<-m!@cfZWU)>1?Li(2bTv~S|cIF-yw`M-U@i%RSvZlJcp zPen>ANK~9a2_e{X`%&4qQ=|QqY7T4U6GeaD-qne8XU3jkB-5FDafoA5p-ux^ZDO;I zHx1wc?P;Dnh_{&jyF~8S<_zzUBfAtlovq)Hqtag?MZ7kBMlkeH+N6;K{V7qmCO*k0 zN75LzHmi@4_NP-$23L~6VaEDPRMt7$PqQd0q*$l?y8U_j;h`V>-7K%9PtPE@)oKGT zDf3TX8Xo~aN|D~5b`!>-dW_OiDSnrMF`hvGOHv{BmnWCmxzmsupACxRBo?n`(iz10 za+0(5Q^T-`KoTD9H(#$Y3ZEoZUL!w$DFpI6;y*@5kr-mT^7FVhEv6d&Uk@K z(XR7ZbAfFX!C)COqF)9MwF-N*((ObLrx|lQ>SWid%1ar$?%E@}+-}m&OBo-pPdkea zrry(U*MnQD8#%2XP)}E1K-_#^&A5M6k4|J*uXZ%+HAz}zZ)OkwUchs@)#uq`-x0+( zyKHxTaOwHEyezPS<|IiLVY7b<7rr-)_wcYYOz?2k+!D&Vd!8kG+C->Ta^xEcUf+*a zolTFW%DlNz{iPQwE^W*ryiTl==Cl(G{T(>*ct-usFx+srHabSsz3YdC$@T*=DXwGQ zQt1BcxzAJ02s_O0vWR}wn}YjsDGgfi9+)BVI#mY=+`u8KRgil2e!{`$m?Q7^K9}08 zg}_~HrdHD)yxA`=Q<^J?Jz+7B9&JH&piUdC%4n#MyB<(3?7GGpx3p{*`@={IiB|AU z@@U+|h@9CYZN=RB+qPe|$dga?A4bbH0egW0!=#Y>3%+aP*K<(mmfl8vfTv zW|~nECg^3&T^>hknVr9v)-;+SlYq_3SMWGCYv-#CkNuTuL6|)3I)ZjmP*p1ND`#3U zCbcEAHM{h*OXG+22?i7tvDc;kJyUpNZ8rb@!O}%q;^(^%*aP9+doy> zB=wRgV)sXRZklSo$2!f#fL%@unXr8BC-Y^`C5DGA_(h_3-xRM!HbmQZhuV~0CnnS7 z)QHCk3ff|-jK{ntIEU!9LN#=|ehhW(+B5j>EWkhYgh8p$lH7@$CY(=*bon3g#qh5m zHJSKX^j4QaoU=`jB%2!q5AW3rRC!Q?dDrBL7|VF}ZwFOb=+xIgUw7tYt_YL}fmL2h z@_2qeifF1lbO%VM4&vTqe0P6?soSW|ajM74=4pd)z{}6vK>Xpn{Uel_(+t_NAzLIWXvB>{6>NR+)Fv#SNOth^UGT!$CqG5@{ z$*gNYg^({C1XJ|wG6n*u9s0Nb{Hg;~q^dRg>x8nynf@=mx{V}ZuK%my5}iiXNa5V! zA#|Wx7JfJIao|fGO^dEOnA@*~0r^WV(uGzg`G{FrZQ-p=Cm@AO zjPj4ObP9ok{Q4T15 z$9h0%8tU8!-e7GrtL%cR`lVl=p&hCvld_1Nmbk4r-YWY+UAFbvyr1v6P{#nH`~6;{ zUtDt6Y9a+q=)3+mZDhRQZoQ73X9%-~7AXK5?SlPvs?q6P?E3Ema6Ul&TOUQdWeIhb z6c#=NE}4=#Uf}h(0adph;1zxUcO~lcQ|cR&;U|E$775tj@&NWw5zun*frsqAAt2{Y zW;22WI#mrgJ`HQMA4f{Pru9DNZ*Poz)o*rs^)@x$%PZq+FOPwcxw%3v4+YPyBpm8V z)?%@Q2nEP|nR_gZL~w~>xoViuoV@qbv=gXbL;T}VbC_`tzXFP*LhZXSG&mwY%@pkY zKeRc`t3L}H!@d5v4z8MJ@FS$p82`9*ce`k5bk=$Gtl}eN&+nnClc|sPll`}YP)M%} zry>$qFYv|Y@24csM+r@8!C7)-g0SIgy><12Hls(oyEHB|_6oH>FK(MnnaNfE9=0zW zcbRtUh8!M|IG1?!I6ksFjiGOz+x{cZv+2XYLoNV^P>RJMV#P^Rs+Z`bhpMl#iZrA0 z{xz91fq{@$Ilm2CLHl=iKiMp2psug4p8#O(&pPsGxeNgx9srG*&9W~xuntgF6sxq^ z=c;sC&84$MQk$?Umkqm&KMx$x^qFX`A7+QGB<&dX#)Ung2H0YD{ku z+e%vUz}^)A{*MarMm;$Oyl(VXqka$dV1?Lun%|Fl<#!~}Ao23IftS zD=TYUaxzQV6Uoa0_363fT*E6I2Yh#XOI5N z@q54i@qQ;p7BHkd2~5^?78Zut4mLX6?8kONLjvzlCf)!}soba!u!u*errsh+N@2ZU z{Xwy+c})X@jd35RC5g5}S$rFgl!SC^HKmIdUSYHtr>66D%qRuRXQC!D$jq0Ru!*;` zsin%VVR%gbz=)yX!|d9&ghRA>m z&9@gZ$_(A#Kztv3Q5t$7sn7^($sKaK0%*01T`jG8t7JzqCk!@jHt>WYRkw8%dyU-B z&w+#?7@iMe!k#=k0?$yPPwwn=ZXdpxN3*cMf$i`RghC*Y2b%3m=XBH2*jycq`ke!d zva>=VfA!oSp4>?>nG*|el_J*xe6IOa<6aVNyz^OVq6d$6^L{TF6_rK48bvslfcUSm zuf`a|-Pt}kEiJsiKLH?^KES0lKlUoweB2UW9D5Yl0cR!?pMMX~U$wX9})n@kp#E zfPT#k=#5k5D4<|qi~$;BEAXy)-38MX;41+D4>gPd5&crDW5Ek=cq8yYH9naulTmvF zJh?^#5nrX`8s8?aXNqK}0p73E6=Gu#bn@K~88&`F$;b;al_I4ynlc}R2F|r1_R2h02NOQCN3Ppnmxj=rh)_9M7 z(&qKh%LoGc(I=W$hHaG6&AC7a9h6bHRHTL!1eZ676897CRbDdu467BHrn-a13->B z+;hcdv6y|4lKY;`hs*DIS4-eGMCfJ+{Pso+=cA9SuX&zUS#<;KBA{wx_yZhs>i~S4 zTq+X;=cac9=?=hUr57zDAe$ciJKRWH-3kfN@1^AB)nXIG@VOj_Ls5vTM>%l;msmCF zeB;sIyw5O5c)>tC2pN#DX)#6oSW@8AJO9tQkqTY4I8KZ#?Et}u`V#V|t-u9>rc6wr zze0*rl`ee$pbU`CGWZn|Vgt5x3OB?6d8tZx!C&yQ6!QAsq=ej{L@O4H8v?3*)>oF>;T9uWyMSZA8nTl zL?B+qZ!wTO83+ve1MnWdO5eYn)&N!OM(ry<6_r%HIe>6cgd$He1NB9mev-L~9I9PQ zE=qmP3SRqnDc;N5|NpcAJvGC%z=I9<7YdO&Xw_v=@7I*TKe@75slV~#v%_qsw~~+- zK%RwC1lwuu&N&bm(#QvrN8odr@314R>Q5lor@Oxi`8(Q?YCYa)$>e&xgJ3BOTyq9A zsn&aSF~KYURn;;(`hD#VtgtZPfsmq+4BDrIT*3Z{7clr-TH7RF<@1lU(?Vf>d}x|D%{(mX=QFk~-9KOB1Gz@qHBB_D$IYJ@`W|;mlCw@nt$We< z@j(7!7|^wlIF`FaHFr|HhHH(MW6ijU_uGSG5zq=7FEaR@nReWKD|9n1$bdD7&qKNw zlMQFJ608j0Qbmb@02wH>jCzHYHfYl&55Ow&ouY*E0VJ?pJDgau23U@)tQb1>`QK$v zlD4gY59z_w5KvFX004&VuEz}(&?-RoDT8Z{ib9m-Hh0uF1zr7@&e18L8QrRtlPLyr5n*V~re-}sdc|LHI`=#(JwrWZIr|}{hTp{3oCUFu- z^TEH73DvNnTQyls-2A8qW@4ah{>UehBoj+)3lD7GXIz~Y31Ie?=-Nua2=~MWd5}o} z>bXT`QLmABUv7X4ixT^*ra|I#WeCM-t6km|FBj?Y>e#|Ye;X~u?(K&|Wce*3BGv9I554bjgpxY53b#~3zohd;#>uftS3J(Hn)g%pNVo}Z;-3tM7%X@hpz5n_p`P?oVa+=a zjz@n<_Ol)UL>jF)LC`;3o}WKLECsgmN>avBq@Qvg^)*iy?5@R(>W9AP45F+yZ*NRy zp`jDlj2N6QG(GOdqFJ#edm4ad&@ILliNMcBS|X^WQgE4`ua2fw1}LI7m88=*Q<=h( zmpv8b(X9%C;)X^h_>xRUX+UVp`t-@Y?z0f!LA z(mMO1t_d~*ckeA+>#an3N1OeQn9W+3j*mlki3;Ty%)jsH!G9Ze$qP733Y+Jdy+Jsc#Y9rcy zcmbtO%q$Y*0NM}Gq*QKGxa~!~u5>y1ng`_#1_s*4{UXy``TpNE(5zg^_cpn<_vvyq z98UXhc6O^+m$A4(qQDYDhlqy$@Ov6&Ms~m>q#gXFP7Wdv1Ns{2+>9vPJ(Yjjm_I_e z>DPa=W%8bpvP7#`2CuBU{Z_20rkcwP(va|igK*U}UDmrlndfiJKTMhaoHnTE6{I*e zG9ujJdX5>7iG@|Hs&6BKijHo4D~5@QStyvJiRTAmVQc;4w^;xfU2(B!{R2FZ^sI0WB7A$tp+u7$z<-;8LHUMw;QjITSm7OsaPh0Wb-BT!Vl8zR$iRpK&_s*3*oBpv zbdbfnL&CS$?zEe{8huF_DZh$8!#{nic3e{7}uAhl>@@STeV>XK+0b{16^=+az(_;Dc(hIYwov2FEKEn^i3s^9)VCi|Q z988|;i4;3B`}sjFR0YL(T>yP2ZGj@iv9U1~y(@Qi<`Wg9~42G z<3MTUNK>q@=Ag-^1|ZugA}XI$z5{uwb&C+6CJQh?eg$e{H4r|pQ~m8Bzys>{^y{GK&+vQ*Bs$evkB?8!+uAWyGPRxM+LFvPRiEc=m?(VO0d-KF0FcHt$TOUkoD%`gBS}v0Ip5( zIo16?RJ~PLR$aI*EJ&B6gfuVR-O}A1Qj(HVQqn2i-Q8W%-6dU0H%KcX@ejVW*WTBE z_3=T7Hm>n6K_zMx+QLg_phzmcjt5+YDIzMhMq94=#-z@yNx+k}Lm;W%xw zggX#lhy9|U+JRxEFqHN5eK)3j@N>!qZ-mhGO5KK z^<+Mauc5aW75Wz9yXXf}={B8f?<2mNZu=$hI-z?H5ha&QDGL5g@yINxHARFzo^wFJ zjFfk3LdhU5?a2Cy*7o#>+jrKErDf;+sJdacsA?_I7Vx3&UVMc?ylv z=1a_8DwSDQp@d>obCB=YNJIOtuQ38HprO%~tC;{A|60rwHJfzK zEkGs+{&O6D)v2%4Gewfm^z%3Fg#}iJEpapZNiZO5J}*5+hvCM3#ypkq-MiLZ><98w zn%B3NZ#~Fozg2Q0aK@Pe%q;Daio(HwL)xRZAg=m#U&tG+EKA6D6Jui=12o9Moq+z> zq?xa3SZ(-T9Hm?WQl$7TXz~cgppYcx;?lfpyx`1n+Mf*5mXniHLq&6osN4{MOfUVZ z2Z7V0#kLQ(7fE8g>op(y>>GZJFRWDm-i{V(YImt=xUOnDRW&6u{E)JG`l}5mb#PrL zc+?yL^IoKzhi5`yBU@w$P+HS=J&enLRk3E(YI8=cYcl5<+Sz9HX>U!A4g4Cz3&`D1ZOef(Zv5& z=_`JvQ@q`H?fKo&tadd#>RWd@h1Ax)`9wxgH`{ynM~?!}hgwfbd;C0{ujJrMZ2`Rl zKe3d^p06#V$vzOK7GWYF2pa|5BmDMiW+P`*ST&5c&9Ns-QUYqBEO8E;Y+k25N}}hC ziBRq;$U%i*JkLJJrA~mZ0yBXuqbPZZEGMxXa}fwH= zdC%b(hu#GiDF(X7-f!_uXqB_D6y zH0}>arJpXOfNj2l09%GzNsIgg>%JpzYEr!e`y|>|l)VU)Xfj>$)WJ6xaIaA!3COS4 zzG72UYl%j_7lr6YNK%v>7%gX)2V_Zch+1d~(@8hv6wiQL#}(irm49@?hSNcu{1d_R z!_^yYBo~lCLQk&j8O@ON0EC$byc*PIP7NnX~rS`N;ex z^#$8W#o3yf0>|?uenM}E`}e>>rPfk6#+B`Lf`>`1lw_C1B&JSvDFZWZj$Ke#s+eU_ ztQ2xMj3It~7Cl<1{HlD>*K*d#d4_#ryHF1ULptL<5s*g}T**+tZRJ@fNv zENi8hJKz=ffzZj6z_kT2hWtWrcMx>g3e*h`OnIv=`cf-)3&vxW(!e(gkg0=M=bd4U z^yS)z@LfwxTy(Iw=cV9=U?9qPSjT3()WD75BxF&|P8Jqy|CkmJVS(f>lz0SQTJ*qZ`4(Gb~w`(G3o{mfwcQCCwfiZf?N|>kxO(MdQHKSMflH) z{l3I1K=mLY)!o`(zE`G!)+fTo2>RkM47xh|Ol*h_B2}wk=#EwzFxVza*{K6yp;i%i z#EO3NI!kjBmS7WvO;O3$ED2~=C&)g}^%iLfO=N_&GBtF{7}Uej$|ZfPF4OzTB$Bfg zV`+-^`ZdPVj%=awmFp(Ft5ybLl4jI9ZZxqFG!3G=pEzJ7Q+pDZK%FSf$fL&Me@es# z{eW5%h}wKFEit8a5-x@>d3}iv?;le?nNRM)V(w#rr&C_lDw|rdik$Rsx+_dOPCZ|? zvOTbPq+w?3Ee1yD?QL2i5!Dbf^e!e8a_ZpcKz$?OPOoQgfCCXXnwZpDt#Cwm?-NNB z{Z8S2qd`ut7o4lPM+^SKq=})UP|!oIE)q{IP9+^*%KO^LQs3Tf1bl1#s~a+eE%@U@ zp`K__;$6asut+x(^tHT6J~F>*qpiaXEsCRnunq>`ywtc4CTHEj0+3dG1HVlcMAf+Ij zC_)*x(@2|ocn>a;AXDp?vY@#w`6QD!WHgM6vW8d*gQ5+!2xKI&%>m@jpuxulyWngQ zAaJ9F?%q0zT(!m)5-!Sx%-HVoL#9v-6YrncpW!Gl2#MI^Z@tCNDk&P&UF_VsCoT3BnHiUVEXH?cmVI)EUEC;kqFIX+f3l zZ57SwjkUx@Xf#LYxUsvfFPI(+B#@*2+kj*NK=i~$9%rZeoioR*N?&g086n%943qnxe?i{sv6Xj6M2^VjXuNEA@xJEp%6;P zHX7$_&Ef3D@4?8W_n)&a6&y6au}x@5j|y}K3$N^FZ-ZL)vi92h@zk=5R}^NGgpjGU zpc-O9<@vI^kdDV#dG@AHekr+Pzll6X);sh^6ItvCQIhqkOm$1AGEmTAPAG)g7{6;W z0*d$Jvr#4y_)Pcv5#4QnEW2OKJJUf@2jd4mpZM!G3IE4yGmMu9rkV4deUn-M;S4X( zI~gr4;tMRuR&H@vRB41e0`SOpJv}|&K|)Fb$-6N&8Go>TqN=Eso%g6OUK|+tXE^?m-Y+>pT>}L@{{?dM$z0Jt zy6pS6ZwLU0%$-9eTvCaJkn&^d4!r<6!;^WxaCAClR4)PcsY1t&2{N8<$+wUw#~@5H zMpivc-{g`QR3wl8w$h)18kxi`)saE3q-1oq%JvMIWhn{aZVl6wqlxs3;Reex{>8Cq z0soVqdJkW&Y4&{nF-{g1mNFSA-5dsx34U#b(NxnVz|3-&Ot54UQKPz`gc;}MmG<1h z*g07V@_iOu_y5ooczFJC=X5>VcQg^>-v~8Jg0iNDbG0Wb3zu!!LE9-EC(0VruB_)V zBi_*h;Mi2NdGSdRmZV4>evk%oevYz#=l|+oe#cjskp@p8A$QyJQ%f4|axHB(1q>@P zG6bttxZDoAknD6>dZHFpke?I{I&$c$IXsGRaCMuNj=SQ;(^yS{kFYI?Ac>NZ1VQoC zN@XGC>ZShFS#F}0vt=6e5^1c`=&?Do{#W`A12`|Bd(X@i7&`MoLzzo6E2uJhX*{x; zptf}mqiaQ@E|7XRCeb*fW{Z&&f9bI0czvKUkM7svdAk{cTe_(oCkt)6*&AfUt;(hR z!$11wcv0z*TS$lmR5*5!|5}oX5q~SE`tSPCgzO59h_8V7c)5OGf{1&$B|#e<%UX!n zMc*|4RnjCH0f%&DpIyoLON*@5vqeBi&zsBzZn)zf`rnlUALd?)cr z@1gjEtnRc)wr)gdQ7IM^z%Fj6?tBurUb;=%MEJhWGyjEw+Uyx230fGZ6}? zXF8*yZv5=u|TLK`Uv2;;7B(mey+J~qmJbgN{QBj zk^2^4+H>1K*+<64$K@Z^pC3ueG%L&O+d(PBoYI~uhxD8Z?RQWF8N!Tz8gBZ@oh}`R zZDMnl86FARK+4xN?Gk&r9l$Rce6DC-25wN#5J-+=p=_DS{uhPxJj-9LFH`7S@8=HH z;tBUXeCM|NBeqJFpQL27dPv3cnj#p|glsS$0n62H`ypFQc#itL>nP+SiKuX*ui+OQ zV~{hSTQOCf=DqPs^(~XS;agb*=t*+r%@GdhVYaCa>uL$<=ntE?#%4=L46SHU#d)El zbeJ43<$DtDkV9RL z@ja>p8@@)Y0chG|2Af1~!bgq~i&T?oQjsBE8T9Gb5ms!NIuZe2jV(c3OlMidExvj- zDn1(wP@K0&Pe>X^73Q3n%M4`J{t#{kdm zoSYh^7cV=Etjh8R2!62>4uC^tuwwTZP{`!dhz)}bG`In}7+YlcF?Pe`2YjUWrxR^i z|L6jeP=zWrLbB0^WoM(C(_d%?ILiATpob*PreLd5ATY*qcas4$r9nfhH=L*Mue{@W z1yZDr7~`ee07m!lughJ#leg)-am6F=rt==D+Vh?^>-#=&1e51ynm?igC^3rORidhU zrNao2_YW-O{9pWQKZ<1s_?Z$I8>8T%xczq89yMKT9W2TC2PS0ZOIH(1$|0*xqia_aEI3Rp zbx+H}NcH3|hcgm|VrJ=k`bDLKCC>Q-6(x(ya?BkCm*ixUK3t3`{OOcH)-mWm4bbZ< zY)i;!L-4DURvAorwOdC0R&|A*$#TdK>DE}`pIts(mjx!v)U36;Auh8D+@*GvHa|8# zUH@LyicN>`$~+{7fLll)LtS2}lBjCl@MJp-;(% z$NsVH2FPFCg!rgq0|FqE&yd3c^>CVq%=OpdoIoLwJS?7u8VcP|i3=Tp`j-yS+Z*-@ zAojHqwFBE6lmwx;N!)b;F$I&~9k%B<7^K?wv%V(Kg+XSH=F6EcxF}MkdZWU)FvSL7 z5vk}Jw9{mTSoaj>Q#Ca$tuu;h^1^HYrHo6Y+U4!FJ0cm)O*g|~MleQ>&k%XC-qoc# zbKz=z#uldr!K~CUS%|kTV~Y!8h&59tlAJl-=n44#bbp+96p>rHf6xa+(kBHnyM#BA zdGV&CW6G)Qh;np;gs8^dSL0te>mpI{rL!<^;_O%3-Rm!t>EC<|m*;mlV&Jk_4@sEY z5=_+Va5b?Y9Dr%4TwH7Iwmm4D_*OYT*mGI) ziAkaQq{K7ElF5H^%1HP;GE;f?cPdlb^o6CiVkjWBLlTV$tB%bGf z60qi(F!MS#Q8{PMggwskPa&`SHijS3o14jt$W>MebYH!JQ>$`sJa`g&a1%;y{@V;7fV^7NBGfG0M? z#02Jph`@4Vq7~O~0p5m`s5frM;KfEBa$Ijp}=$T6*{bHrs?r zvGCg&Ug)EAizE^Gl$C91Rjfmr)_Y*BO-;EvHv zG^$&EB}1R|xsDbZk?SOCjeCVWQP_D5!$YAaCMWM!Y?lDe5~Es1i*|%-VT+!-j+y%Tq~%y&D{uIqsz-ssn_K6)umoAUmCCj3ZL#&RkREkNgJpFE67balLP-@fc+B5>T%)+R|4e#7R+cZYD7?6sh7fiuAmU9(Nr;a8b7WY9t>B7&ScpMT*@92ZkO~OX?Akr$}uUrLqxi zD!Cw%bSf5YN4fd*JrRozmrC76byv-auaQ3aXZTZCQ?w&|r$OIiQfAW_w5LRH$fw9c zm7W-w$+9eWcjZUlkMcwB`pJi<3K`nvjr|Lv`{%?DohGC@ULeD=v)ke&sKp?xkw(jj z10+cOpaY7&D1jHa3cBH-2}&gs(+&?*@Q&xj?E8XQ#HChk^+ab@HJ3axq8$_&!~Xs9 zc&{b`rCf+K<8vERMUuHAPLUV+L1|J z0xgzbFNq`S4%X2_b$};`xyB*+$m|xa$kZKc!kZjpsrk(#UfzTNk}Li9Tt7+!OVocsY0_KejA*OuO_~~zbNY)d-;7>OZE4Zh2X-?y**%o_2x^-{{7RmCIIy*t6 zPeCZWjw}c7-uY(VE$F6ZP=GDzHn~xL?`{h_MLi;!v|LwCD<^K7`1ulX{Mfg^!hzT$ zYNukzBv5{KTPHRH#!n7I*>pwh4O!>6sX}HW38zhPbtjNZBv^LZ4??6mD013olbEHn zSDU^at?4fwUfyM2{%-!}dBVzyzxXFLZfEozbeQ(P2d5pIW&abee?}#ms4VI8Ac&G( z%Q$C&9%c4$HfcT4vC2YeH+OVhkf=R7Ux}MBmd4G&LODUSN8TOFBq2~oQ8B7a-9&@% z?3k0p`GW}!TOK9OSs$%I9ZuSxc|aR#;AC~Yd<0~GE|w39;9ORRlfy5KF;?!8FvFE0 zvK~n5z^Vx8Qp=TKn`mt-;l3gdKAGXQnxoOsEicXyA{Z09>s_rikmf` zh87Xo_o>!E&N;(>)`8t`##tks2;*-Ch(0?U8ez$k=SDYF^L=Wq&YIX8!2-ILPrr5| z-2i8;QmRUc9I|;z_^DbP1Bo&BHa<2cCg&uF#*u79k~|*g6F43FN8+gy@j?R$-ci8e z7?|j031s=>>l*Q5iWceQ7QnMH($mYOUe{-^v$0`+H$w;)XKB>181($P7yQFzn8WVokMQ489x93+p-R0vk8ov$G?^byRV2JC%AKBlbIi zxVD_)HVQJ+G#ptE(%F_rIPcwK6`37qK0`7o-j52vTAAAT;6Qr68f)(WrK{%cgnS)qn>(B z1PqtF4Aja#J#%LYAYyx!ixeAivE%f+MS13Y8#r&0-i zN~9zXCFcpGu;=~lD6m6m(i{fKo5&_mt7Z;5io?CbJTHhB{5?|`PO-1aJ){gZXqUoq zSv)?onmeyrUzOHriJ2uP-0JRuX00RhVi(v==Q2e=iH-e+&pEn$D8QB_PJqa%Qs@q2 zrz4I|BM8mQ8x={bI=Q<<$}9VyZ_JBKST$VXPTN*oC4xV|V6DBYP`BJ-D(m}h(g)$Q zYLLNO^7SH*GV*jd$drH=#pNw3GP2VUChmmH%yHXUXCKhuF?&2D8+xl077b|R)y7hI z%^u_{W(D;~!rAC{Q^8@s{y*h< z`K~U+hQj_|7aW+UV4yUKrpu2H5@R?&t+hQ13m`*wzpu5rdgoKY>Ab( z{`tLDwS;wPBUZGi#cAI_olRFYOSI??A$5@Bt^joArg(@RiEQp3VEHQ1xUgEz6d-C= zXm1{{G!qT290E}=(MincAll1cZJ1-~BO^T`bd{(u?Oe$gBaLsZgNgW>%R7`rx)v*s z;=PY8c|)s>`{P1|4htTNnc%(wJ`hY+ zm+CqQI!_h+OkBr24^?XjUZLMm&yfg{FBhi%(4iwjZmqjdnX zaRIuu$)oi3Gb2D;_2$~wGYtrNr~p!INfyN$;XFoeZmn2P^(Nl@BrLXxkYNt$YFd1| zpjW9*7I3Y)D$YaG+|--k zrKPLqHwen62CsyC_x8xcDt|1mWI)o;JbCr&+*8H>$KUy4^Yny+(ll0qm58>7dmI70 zNiaa9MHBg+0H~Lv8vY_+lQ4+2scZT?Iy|hro1>-Y0ION4Yqre)I3!(Ht2e_(>~TOQ zQDZ6c%#!3*cTx**+yUbuyL51mWsi=46z5usk}v03sZKLH&j-`_WqR@R=KpE0XtuLt zwtjhDO@w8vG}*qBHXpFcUbQh5ZI^a&Py+aos#Hw?)xLSXutfkbV8_arEDwq$KPr6s zRH7z_m}(B2y_RUHm}!a6@M5cR@*uc${T*nzN|nMkqyE@#K*7R-PL@cHVehVwxAG;> z@DE-`5uH0CbgPta!3WD;X`zC6PTj4&2%{6fs>lN;?cGk^TYjw@6h!DF^SA8~OZO*ZsOz z&!&Ak<8uG=zW%?jAJ|r5pWQ+5@Y>ND=7Z0*(8%;VjPSniH1dfOOt00av)gGH?|!ce}lWmPq`M9MiDT<4hLDSgt;hj+?&FKGDA+Jz5~M!ce5M^wabT!lVS0G{cwkQvp(H?W zq51zljQ?wcPDfDwC<(g)ZY2$TNI)5V*|78yi*&;DNmh@fn|eX4y47M;rbon3Y=QBz z2|#0OL4n(!TaYTlNSO=yT@f&G4G^|-<7Jw8Gsy3Qp`u+8$A~8*-Q@$5Xu%>`-bU7e z%$XN93Ixb%^b^mUq}|{PZhmJToK+I?3T;l;w33A`cV|V-mFA24Ngg|lIrI4MW*aEp z)y%R1saGRck$VJelj{H2$Y(}83~-b_DNFu?lz&GX;fr$_fF0pdNG;;zGctkQ6Qv~dVLtjj*s`RMpf7RwRr1Pb z5Xdgd=FER$3635uR#&%gR$Mp^wG`e*xyQX~*iBR29}MtBgNlta>}*)D5&HO$_V5%2WG!#LNTD{Pcw%Yrn{ zVqsS5(*)K6HhnQRX>=M4sG(sFp`DPBK6S>o=gre;Ejh2CN=MG8CWC2P=^kchKk|GX ztg{pOL?b^vzmc5?`#g@4Ql=TQym0J`f3Bn?v2iGFmSCU?TvfPD|NCtLH0D}N{aG=? zmy8K*kjwU@*YtO8$>Zcj6x`ypC(eN3w94EL5BM{j!$7MiP*PHs&h&uuj(&cbI@>c3 zbAXnY_q;0u4QejvbS0f_&Yu!Ze0*$l<_@6x_UCb^B!) zfs|1Fd#`&V@S(W@b>+MXgO;gpDkW8h@eJ>mt#GADC7;)~USE(8&asw5aTE&1KN-N}xnS48Jk_jDLX3hWR0ae|k0tnP4ReU~!F|KK;_%ny=!$7rMj=St|GgSrp-~>& zvG?^>D@=H)`w5|AakRN4=N0V!B-!UP1JmboEJttsX#WKDoSav|aDOl^uHzBtg3TCt zzP?S(ekPgOpJ+x*im^0n4K|G-RVIgyim`+pL^tKKjf!Ymto^cHc$GQ<5|q%yX7Eip z?ZP(l;Z+WkNibq!;hdCN>Ik?+?i-1Wcrr`A)s54qmq+hWa7l1M!{RyiLjTnIQJ+za35iwMaHmrC;k&V;;fi8X#VOI=oF zdazEm5+xpnXsAX!A-WbNA&|&t#soYg%_KJYNSb=8kP_W||Hw#0AX5!sx0s|c2E4DrkGBOi*uar)Si)j=D5#u0Ecqzh~HbikVy1k6#Tp=#ky!y zazqYHXOzei3Y}q9P-gx|J4%K;W(PM^wCq3@%jU9M5*adca5Xryn#u24rl;dL`sX_d zW@3MwEHB-ow4K#^*Fv4>A&($=WSgG|7~9!W_UQ{@FzoxO2`dq%Z&BcOviX7{?nub2 z=Uov<9?*O5e*p)Qd0H|qc6T)Q2wZ?_%5H^z$Q=IXZs!9Q=y#mr{Y%f}w`t{5Q)Am# z$P6ji5K1zWX$xte=XNN=X}~r#Bu4`m!je=sE8E1k@jcs^l|%EC*U(02D5|4uU7lDE zbxs?}+0i-q@E9r11f@i+G^&iozX<|=FGc($;EZ*gwH>zmfj1}yZ}jk$WM)+_1iC4k z3#Kfa@GLn!KcRXsz=y)|4-%g0WMWEtjOE{{NvaIMVHPJe1|`lOaUUKOh2y% zCoI=v$UJ9{%j5O-8X=%IA_tQ3tvE*Px^`+^Wxo{>!S!zT7N8!w{C>F6bJexK*PD<~ z@MLDQ5(t11?3TAK$CclV?H0Cwn^VL8mtUgH1v3fc(Q_v1iM6YgI4Xc4i%L_EMa5$g zPSgNcoieM0BO@!T3Wn-S6U`;%ECF(iFa=p5h7!c*AYkx<@I$^N>Q7L-mjkr?c)mM? zB>AIO0E4y4M19A#0h-;r2P5BMU}5DGn{+~!r1ij+X-bSpv$qD2ka>AOC&v>&d8kvX z(k)Ae{Q)dE(U@Q@1?}U~-c%QzY)M?tfIQR;ve$asz4L>TVEX{KCgS{d86D< zFN5=L&fN&Fhi1tG#bgv*s3P6ocLSZiV_?mg(b;)u5-FWP`EyM6-S(YEQEHnlzp7zo zR7{p2cB*DfT3CB#SD*711o~JHO>A2dxXHB4vX-sAg)eV$jY{x8gIaabvE>@XJ55YS zCrK*D<}=~R;7G0uOt#93*}!Q^O~6J%+^yHT9?$5+FrL?r?mH zxaRt8#N&nZag~ewd`_m&99dofN{BxbtR3;?AY=p|s!ZHXjc5(10pP8UFrFrycMv|P z=saa94anwNapagQjBPKJVJA7U78J-;35y5|ArRWlBLRRUL#s%mJQ&aALp}NT>lj*@ z#*@c?v315lveZC9{L!d1KE}#}M-YUtkRtGWdo@C@z*_XBv!FFnmKDZgT;g|w8UazZkab<@HCi*$JWvzQMy@xtC|O;rQx*~`DDVx!3Mv27;Q(CJ(vbKzL% zK%;hkan0+bF+w%MF&mpN{qiyv86QvioFa;o|L zzgMcooxvHlt5av`Ptxx0Ec?4SUmZcH(d-xc02n_^4EsR&`mAnj4SdVR^^#Gp$bUC^ zqsiguF-c`;WT{EhM+5~fr*K^vHZW9#5+dS|OXF?y%SB@GDqZsnRd#rIijF^x(w%=4 zVTc3+OEkvNJ#P&|robhDCPoRb$@1SqKiYOycB8STZgIATDrUog!paJk`5WWu9LxdT z)ZX;9nSP1D4>l(fV)f0RwRS#ePi5`n&;kV-=>{9jZN}7cD^5`$49sGQUkS0G9#=Cz z6?th==f~G^olf}BWDAOhtkS|60X3c|Ra*dwM6GQuejKK^*60Vj*2JJ>rF)8R1zYqrv2 zx?{H-_MlcgwH{bpLsuOB9(IET6Quj<1sh z7tDUUwd53aeYUe?0Fg`-7E)0dFtQ=S_saf(yOF!97ZxHXA$0)fHDRIh(bz^_taVg z;stZ(0oL&OA8N}U=)%BA+UvuM(0)pp{~ksKx{(+Z*IVGg%T5kj9O3v-lfqsf^XC^q zLDv;=FC1EpOwnCX5ZU!#;%}ZG1;!5#haWt6*@QlL_OMqD_cJicWb>2YlX#JvroFG0 z7%9sj-%w?XBv%t7WvjNz#kPPY%RL_(6FbP}NB{F%>wIdoA&16Pm$xIw^n-yI>Az&d z7!FI!JfQO4iopMNGA+i9g!6k)hvEhS^GY3q+No1^JRL!en2M%S1jVs~MP z(A-`2SjRPdgPG++%<8nqGv}aw_AmAU4E#A;WHs(r=DU89(X465atFvzXYrOc*fHtl z1aKu(ua|j;f7QOW{M;Vo-{2hh2Hz%L{wlg8Fs99e?6T!Khm;dF)zCYrf;q{6bXJ}U zyP){PC?B$7rsh%O`11$w8#}8+Mv5{hHmukp3$qSwSIHXHxaSLA!QPb}3hyQw_Sx?N zwrTglBks35q91VH0eAmKFsjBOb|@X)%=KG_-+2TZLnHGPaYb`Rn#LSSf~iSwVKaJYx~{(RVK;m2swf+5>sC3*2dcw}|_;)MzJv+76b%xg?XS(;;pu;c{yq)nta zGJjtp7(JPy)W_E`YSm<_t$)M~oHgcCd^3d}$6*mx+leuby_ID-gK1`WX0T!OLWKX8 zX-KEN*N;_KFF1^qo^`QSC>c;nb$t4WX=%pM@P#OhEuM5E9ic=qYb=SZv*u@gMVp)T z2xj}=Wecv5A*j?>(W|JKL+jiY>pj@ELEXIL( znAbMtY(6rfZ=gf5NQVI3X59g?eF2oQibBtCK`>TQY&COA+l$Zl(>N0&=hrzdcyu zMEr9BmaWy~2-9jYm^jG@W&KO(Je^VMFDb=In4fbF58qzRsqKdfC3#$Czc}O)2%zFD zdAlBi@o-DpFefv5S{gWha+#ytyZjFOR>MLl-~i1Jwnplytf1`dZ1OoORN@HIvZ+jz zPCH7kd)h6nDrZM3e5U^<8dr3q*U((#w;KU(eiX{#oO-bP-5^Nb(ZMh7IVzgDFY0b4LDfz4*Y#5(d}_kHH}b2%&!)t4`=j$v)HUsi9g&^ zI5L$7?AQs$lHdo$2)3cO?7hQ_n+ju7uvqjf*zQBv*&zi+pLyl(R5oU%WH&$`^2RH_ z09(~UG97aWHHQnOLnK-g&0BAogR|oc)L$eY?vG#kBM5bl^lVUXZ*Co&iFq+U3q0|h zo-C;l8FQuAI!qC_UtbK}ZfE=aC^eJwJkNe6?OtCm<-8+^x!p?_E#eKzLwO5c?a$=) zK>a81cs-i}(kiU(V?GyhvtG@ufJk5hFnc;Zizpu_u{<`C(8e1|og{=tHqvs1@4=q|w^4sl|QD zBYytAeFzC@@*s|{;h3MS;}1OeIps38dXzb2KDLk8GS)OPof4Tfw_nG7rpBs%Cg&+{VE&JJD9$C%%eO3xD*7`%W+}RS;YnB2F zrDDfL19&Syn#>8XY5yNwBpcy;EagRi35;WS+;$mK@z+SHZ_y3K5-kai7GTG?*}Jrn zW_#p5J3i7&^(^g5!WgbULDRs((n!mQtYvi!zYKOmQ94g=l_?ZCDG@3RN$#klD6B`c zhWXE<`+Sj7_gK^&b6x_?3Mpp~{F2nlxg3w?q>Q!@15f^8dE3w!bCB;}-Ve95SCLPP zx;#UHJz(Q3`ezD{DW_$W?=?r@$*&;a1k;qq$5>po*UORVq@GK0tV7Yk=q9C_RGJ3`|R#D}~FG zCk*%JQ}w4Xd0a`v z7b^u1nxYDwfLk2;j5MkL1=V-ua3IUK2czU^}hOwC+dq-c<#kl*K(evw;;24`ZE+tt;k6IY|ZU* z?JILC6+8cjSj$M}Hk9@Ddn=t}mF*c|zZxA){C}UIh;SDI;6{Hfs0;kB7GMgG$392) z97s43oM%>YIQW715{dD}A#^xg>+frVkiu0h^>@5fn0AC=$lP^GHU8EjqO7)~*<}oN8qaLP~v|J@EJ|$7DtlWcx&Atosa=#!{%*-a7}noStAa5v0Bq za-u@%=~0{*=s9-%-TRuk^BaciS}5xq0sCps(EZ8HQ0_GlFcZDz&7XA)PLV%xixnau zb?*QF8(z{2)sWhnK_90O@ozV}mjujvFLP=0U1+GQ+P-1^d;CkT${R${NbxkoU-t;D zj|O;E9i4g^ua^T!H&CF#OTz+nfknl`9%)uVE!W=9JVIG4PjvwW6KCLwacJljl%i8_ z1fHU)?UJt=g=Egr7&Ar12kv5*oJ5bI2#_s=JYLQHdpY1MwS0G@mO*Iw(Y%=WCz?7( zr`nqzWEIa>N;`YIxjG#l+91BDciDbClr%NT=>Xk$&1XGS$F$@>RH6xkB~?CbI5af0 z=SQv>EEO{;3!uq8fv*Vx0ihcpkjK*~f%4#Z&d1GQ(1gMm9+ME+LQGMDkg{{)SWHnJ zz733tQC2V2MD>nM#858hLVsJesz=Ex(yfMY>lg~O4Oy+SrbM5~V;b!;Rda*Z9pLkj zWzJ3*Kg8S~f@ak4Z{Nf{Jr$ZKm_Z^#=ePf8?0*aqUygV>V3m^G z%+A)Mfp&^C^C9ECEt{Vr21~j;&ssU|rBbHocm^*5jSXg8y*2J+1W|a&5Q$X3*G&&I zN5eS3J!fju{tq%|pS@J$t5Pi75Q~3*C2p*%Xq_fkgvn+zf|aR2$gdwJ`ta~@w=dM| zm%8^hQ$+)VEUCwKy3x=!phm87+E>(b-%ZqRaV(T-W_)?6h&wh+Yr`Uib#>FoXUF%T ze?e`YomIDXgWQK>CTt=-YoR*BSz&jkgsdEH=3RV4Bqh23T^ zTd|2ds@oAQ2nb}j5?JXrDOSpVJl`zQsU%41cwoz;pomS3f3cGrPb4vFU-!M?@!mtp z^4fqx>AF%ZUXWXFM8L2Y-`>PnMo0Pbvd{KlUpu96kp7rz ziIE?r2Fx)p4{$%u%&2~mD2bIu2FWna&Kl}vkk59zS4Z*si8xTdCHXuKlmnYTsVrDB zYzydYQJSXB7Ryon;D$Uw{Tq>Mmt1&fiF;v_oDtc5H{rZ=a|1S z{v{o7P{LGK(uJ$)WSyi|qs(abWI1_fplhAoj_@yuV3^^W_qz*?@+97ylYa36h7Z7Y zl8TW4E_71mw=lpr9Rwn-wq6{SpWFcbM7{u$Q;4#ftQV^eezvf`5T=8d3k!g=YS!`C?Ld(M zg#T~y0I!zxF&vk95NKg9fGSc?oWTnI$81F4B6yPa%Vx=?A&2}ofHvI*o)c9kL*R(O z<9A^I?3pz?k;_TJX!LP#NC>?R5A)&)p!a9LGETrSQO`PZ+ltK83>hgZF)I+a@+1vz z_i(#NCN>)^O1zH?BvoER-{9tGX=ypi&YH9I_V#AMgT}zdF6h*VdwGSC5k8sf?hv_> zmVI?C&b|2W%~Q-~O6``y_%T{v9)TYq>vhl5fee;Z`65uQ1)PEQM$_cyg5N)RwtH|6 zPG*KD6T8EBwYe}lJps!YJ|8{}m`N2$e>8tKEY!@0wgrOO0C?5hz5)Dv>%H(0?%6R_XnC{~ef_gnKP}{9$u=pMj;o(kT?2S|2dJ-mHDC#>6QA zF0gI>FGqo(O-Og}WUb>Jg=9oG;9=%X^pfW#4-XGXd3oF~O$(7sP=L&R-Rp+vCGiK5 zi$DvNx+Q}XePn)*>F68v|6c?$Qp8)t9I zdiVt|V9>I7{3_47qODX^k_$Kw61C3znrz){Z|$#7f3{RuQv6$Dg?ODqZE01|ToJM< zzf5j^ap*bZ5mxkcgB&<`Bb4_g^QCI=^%hf-APj31g&gM-Bow42CFVt1Pm?gPuqbzA z#hH@pNMS}maGXd1gM$Flmz2ZU;6lrREyf5C_McW7O7Alva{xue&Xs@$d|VJLWEtd{ zMLk=b33TU|%?Ld}`N1Ga1O(?9i@;#uNWgv_NG$sPw~>J^$2gakl#a$spVnAJ0v-8@ zmORYg_c)J+j%F(?jGRkdH|7V*4Cd|GS@#4Q86taX96$TnjexuA-kMYARTi?-zH3(< zi1&1@u<8ThTi#s8{Rp@i#h>bc+Gm9#T3{=XIkNQb#S+p6^xnUwtcKA)@hMu?5V-z5 zSx)-9PQ?zQTk1j3&N7hERLaCt*}V^qelgQu?fwvZ^F@q?tqOqeOwbDQ@3`$wJnw-! z)%Id*ASpQ+_ewGAl77p0pXJdn&kpY76 zD`<3I-!p`R0O!Vq%1V#pg^&N<9%ENmbb9@QKTKF@?E9MuMT*W0Bfwg@63MOmXZ7=& zRoh1rWs41Ax#eFQq;pU{y0yK!Hz$YhZyT-Fe!hLaFZa55yaByN@F(9r{xDRE!mh@| zn;1Uc1k8l9TdcHY``B5Dpa1=lo21+6Su(pz5}l%H?3{rBg#aTXoHqj=)u~m!%uX!r zgj%d|SYF&gp|doRT1QL>{J%8G!!&|o~Y#ye0-LtD9Zq+Y&SM zYJ;PTLNjH5JBL)2etqfri0=CehV&H_LW+c{Jl0f7pwehBJ(_Q0Hav zyNdhDX5IVu1lTaIsr6WYKQedL9K)lbnA5q+ckq2V3Ho~081P1b<-iGvpT+f9^8>h4 z#EXL^V&lm`s3t-?ICK;l)F4UZbpmSF3t%$*3E&eIZ;TA`+F5`_wB z93xpCN3=tSfI9vXp&nkTRc)Y{a|Pr^F5rD0QZH2T;7eT^zR}k0AK(}p ztFO$SZ2od{-9)5QZqkxevqz`KIzeaP=BM$sBs=l$^Dr44)c?HE!0H=|J}XN(a2zef zaD3W^G5bA+44n!Jytp^17P11eb_Q9f={(b&x0~=|Yc)MwPj|(S7yH?N_mdcz3QJ@Q zWvixz9|PVr&A#Jq;wsKGr^P7wn3kStJNEB-^Vt{kjF4LN74$#y8v9LAT+2cabIdrT zG;vOxJ+0P4@y6F^JlzsHrS{$|MGaE(uxi)~Iu@CYL$oAb*RlFe6PzxMQk3J9xg#j3{+eR1%w#*m5qw3Z@t2sAS_2A}(<(P5h!tmvP*In{RbS@pqYkRAG81XD`R_TUMG_l|b z2;Xs5F$ey0S7yu1vLULT)90tNXFc5_s~nt&pK4dHLQR0`tmDtwde;)NqC2l_Gp0OuSwh1yAF}ay zYk#IOqCf4`O|@}Z_MY8yZB3|fUAHJr61`iImCDKv8#g0EN8x~pt6m*H_h$^1R7z{W zuiYD9#z0aKtEIWM!+oGGl+^y>)0|k%I782Rv6{9+8u^G z+_{=Z^ExoNr8tC&WbsjSY(L};r7#3is#Z{fycmR6s-hI!*0Cw1>rXU8QQbU^JJaX2 z6O?YmiUcZ9^mhok+F~CBInQ`bl-Lq`B0G1Az&Ki%kH9_mfTbxbVlUk z4o3u1pns#bTe5z(?#(*ow@cCUcjkZeon4dAq7vXm5)Bkz8a_dYZ!VETTDGoE7FTmh?Ku@9*QhH0Lh|) zNJ}^a4eCpy;k4i z*Jkm%ZAV?9U{fc@k&S>nyIy7)>ob%GkYs$oBp|EtVGwje6~{G}_z&_fi|>p=hTgtI zvTB?VZqQ9br7-{Jr*M8Bw~}n6DjEt}E#W4V7kOCvI~0H9=*!^ebgzYZF_0)Dn9=6j zOSW-hK6*nYPnshaqOW=s*c+BK@jX6$RT24&W89evtdL`fedICTj`6)9kKg-BjfJF( z%_tbo*SRZRfh0N-E*bV`g}eFiPnYng+!CUYNj{hei`HvwK;>R>M=(i}nc3=|_~SZ# zD0PgN>y&5W$zKhWyNAo&yb!`VQS!4YO>M&@{$rU;qk^t6KcVNBY-`rUJ8`t4H(hgr zpV!wZUu;|K>rlHbdYYGO;@&wDXqco6xulZf@`2^%?{=9Ida@b0rFdOhmYZr%an#o? ztayrl@u}#z^I2S9hOZc;Qnul6Z*WtE!)BOm^KP;45P!@B8*c{8 zXrk}dNo9s#M=USv3uB&a{cDQ3#uC%(Pp+Eu1(Qa5&u44Zt7ker{$@bHEQW(o>N� z6f_Lwq_{<|c-iTeG?hF)UHbIAL(xWQmmo*R@#GcifS^B``PCLT*t%hiE-~nrg%QP# zqpTG9Vw&;##K+!dp7lXXu6*krp*0$9hOZL1p|{nU$aF3G3!>Ffp+|}QuE<+aWA!FO z{H%wyN2T+;W_f}RG4a?RAV_82T4T=J8l`;ZoE{zV`zH4q$PbjoFnKpm&u27u zyn2+5{q)51W2SohOXujMfA!H+;m*Owi0DeIh*!Brqahl;&s2JzEH4({wQrkf&hYrK zanhv3jEP>=zU|3Hy=9)pSQL7htnm8Yx^n5#Fqg-UODUKBNLO^if)Ur{Xs<@Faj9VJQ+wkt=Toku<+-FP*AmO(&Zup(#^ObJ+JY$TEZs?I7A;t({k27u)EAh5#o z{f-WQR3EE*EJ;W>8|1omoVaSXtd}YLqcSRIq^?>qlYObT>y3!~u1da$_40cGFo1qO z0Zv8j{lNHCzTkS;!I+d_ApLUHK6ciT=4Uy|079uBcMT+thpSq{S<4i+uIp-zb=frctVi z@pyV(T?^e_z$07Hn^$d{JZh}paM0Jm^@Foq_o|4;haI=ad%va^_98tU0X@3;J@+>z z1pQYc+~in6(0vjGjnD_*$Zt#1_aW-y<@IN}Dtqrf7vRwDR68SV7IR)swI4LP_URzHXU?$%ek5wrB}X|-rkBQli(F$IcGoAM1!;bd&Sa| zAaF*nR+X*d5pg;W@^NJ1o3gPzy7_$JK%AltYRMPA7toe!8(sXbRH);X&%lM ze$14nA0InP^n{+KLjY_n)~$CEt#O|Dy*U5yoFrXHK6*`2wjIo|4zIRyIJ^xt($*!= z9b4wEH=-W9lZFX;9qsZpz(Uj`zM+=psi=FENa0vw^+JCF<6x3RR1m=lwiX2TtRj%R zfQv)Gd@W#P#GtLU)up`KqCM6GqJHJTo*Hy)Jg@+Kro?5cbr6gmyAF@>%goofw1|J7 zzp%q!*dZ$JZ^fCrEygx3+T`ntns7Y0xZRSRN|jSop?&MygJa|A4K@JW2-{F>YDO+G zK3nxB?O{9>^$oh%#JCFHqdri`7l8mdaMuL%jmvj87lz>)TWK%`>@PYwIaw#nQs6%` z2}^bWKR}WJb}^#tPVL4AngVi+_(8AWgWYV5*FwIIa@T!A8R%JnU?fVCo(jV4Lc4)PT3MH)gn;Y7|gF6v%%6!L>G5vQalz;^Y8TL zYV+h8RJ+JEIA8hWT9Bk4BlHw_q2YJA2>n`{T7F@8chJ&@t0&iyLWes0PQPIxx9Qh$ zqCVB3T=kV2H*OSXzG}MRs1)W$>CMwMz904;Jw9@lz2lXuwLEV8;qK>vQt80&g_g;s}3dMKBg+BnOH&x>`pQy-P&1~gx0o-xy3J>EJWW;*2wm$sKI(E@g{=Hh#KC;@w#YkAJS5(tu>{ZeXA zT%2+87gk%#wJY&5+>~FnJBk3%tfBabbNs{rNHfPYJ_h!6)yERZsTy)_p|u2?w+t#; z48RhYO?!VEOe~K)1Yd|rW}-_d`Dmxa(S6o-c~3=5M^UZE*qlsx3+Pu@IF2_^vliA` zxVu**IlUv~uCW6J^_>=Z19gg|hsB`&+B$cprd&wUEm-I?kN7xj5KE z@yQ*dslCM8f8Mdq#AD1W6KgPrc%DVjlZXVo6S1efU|#q!@^j|5Q*hV?A~sgsy%zd* zn6v-}vWmdZA>*tZC?F{7k0`2-0*$W)J$cs;}DJ4vwOwx`ibemt?eFvK| zi|oalhRq;i&dIg5bP;9#6EU$Cb92$c-uL()Tec%u;q3l;zI~0Jgbmlr%d4-*M4DCR z0ho0Kvki+9M?25SquDjGzCHvFJUd41izQhX0KnDz@AehOu@WPTX<^(YkjMTZsnLE| zP;~7h(`Xm4=ZgTGswu`rPQ-U}LE6aj2Ti?M)myYvO#-r^Vtb3F+$Bm4mra$};(^ia zbVxE*w;N=WYJH#3&N9gj)^J5TAD9B)2f+w7_#W@;H?yHC%`K$uDhPXeByxasfyOB4o|4;4EyQ!^Y0aTi7eWy z9zN|MStSg}J3~9tD2kEIxq2^a@WV6lA9l*um~*O$TP&gn^Ak{sKHOid8%Ed?6_{~r zE`HqG(r>Gpj<3@1+qLdVBrx7y@0#rNxWN4Y>>^YnzUQ%kD*9D`Lu!Yt6h7UD4Y|F# z13F11AU-t^22`T(xsmEpM4)O!0~t%wP`W>47l$QP(tL6j#uhVQeBi_0P; zn3vNC*bL<9mga@pyk4vrhUn3l_W~Eda*0XqL)L=8O{PZn0iR)EsC=P-c-PD8_WeC~ z57EVoe*0h4)YR6}yEQnFs&-^u%tdTB3II592PbZ&oF`tpnkGHM>9I%6=knl z)cmWyqSu-YKTQ=y5AKP};mbaQQX2d?E8}(HS)vY!j*s{q@yL3kx929iJLaEWY|Qng z0YgIOndJ2_Yoi=iQ@vj9=MU-5@^ZbdjA@plcR^4*L-XIpoL7GsJ^#w1%BImS%ROT{)+1wH(Sd4cUueZBszb|doJPNX34l<dIo~=B-iA6NBKfF8gYU37{Ucdn2vm zbK=mucA-xrB|VnP`-fgXp_>P2NNAmg$dgfj^24)HPopRU=PQuN_`O?+<>c1N_&6=d zcq&n&(A*g|F8)&Vs)U#8Ickk1wqMtDmRZ-TaZ&SZ8AgT1Ouy1k@jO0}=;W z!|!V7N!)PRIk|K9#+Uo{atx{5ECUnYgbAwNeJJACNc&y)0xX%RhmxJPg@b*%r3S2B zRzslzww#vh7P)?_Z%Gqw9%-9Br)@hTCuNE^IL?4(>Gv(wBZ!K9g9UA|0HBbXNT!H) zj=X{#U??FC{`h*es@0i_6EDv zXsw4wA>I%SorATtwL?4TGtb;=9rfK?x^1xjQ1BVgHzZps%}lhugR^zwntsQbUKy#0 z6qyD{0~>_UZIUE9svD#>YBrQ*dp^}!5Uw=Iy!n$05Ly>0nbof;iX4~Non=x~R_5U~ zzWvaDobYnF`i*!Qe(G1#(hMDUFO!0tSg1y6C}sW{DkeT{?h4!i{J3OaN=4kuHr6}C zawjH?ax%3PGaEi{C8jfwum@}g;p>leqe!WH`W!~Af?A$eUK8K{ENDMy`l_dU@go3` z*G`;1iD_nh-e3Rj@mvpBY&ZFTv^L!CMeK7zA3wsBKQApxwX~R@3ZcJt20v1X08zpfR>BmTdel=5|7_b82P8DzU(2}dC8nzJDX5? z{hE8(rO6Ad6U|_>JlFo>_}SpbFhqTJS@7Bqp4KJ7|K)?W~_KHS0!Fn+#9*0MZkO= zC&XHxn$rm>!D!x?Z)C5%wPf4YpI02^GL?5@XAO(pE}QJMGf3yXWq2ZE#x6kbv`w&5 zM;`IxEORyZ9e*x3|86Fqtwn2|#{H5c=MU$&>j7=14AIh?cH7Z#4R|M073j-$T0@{S zTxcSAKZd!NjYT8EDpxDNMEZunQvccwDo>% z%Dq{5MrV9$DJj@})E}qfYwgM+PiCSilg|dhP~*gzUa`Q#Lhz;uIy25Ac)8{(67i%{ z%@7(#i(p-Kt|aP~s51KW(Rz%SI5zB{!>;rsaAJ@A=aSd^kYuN2Gru4HcoLSCdz*S} z){y?G!91b_ZncW`1nrEAHHW#6emc@?J!(!@B?@^rf%27~@FynR-&wBR@_9}dt(F;% z0#`8RiMg73{)hL(`hz42F=iff6EzOE*!hszNy-gd@#UiHO6Nh_dPf|j3O^Ri_jl!J z7T6g#tlp5BWkky&?slIEjEyeS?|R2TLK9Fq=G`W1mYt3kvE6;EaSe$iiq7ID!#4Pp zfA{qs8o{TzHS)L&xM``nZu=u3Nt` zBZBxhnJKqx+Nt4lf4>b=OT zc(6k5(8MpXbh1Gv-k%1~?qK@7==b-;OoC-V{5&hjnP#e&XT#uSN+Zz2=q`Lk1(#%% zv1hAshA!6yqe-fZP1DdG^XXdloo8FvtZW?2k1;5T@hj|1wpY~7FQJQAGqg?~_j7bR z+$;<(y7S^YI4#izdmT0mDyF!)y1MRQ4e3L|m4T|CM669pHjcjy8bVc)_Xxp(?&d3~ z{?I+&|9oENBBy~><~0X9a2KBqTwkzkzmls<&dX&gbME7F1Vvo*_TGy7s?UmSd%awk zb#dOK4DJV^BH8Ohw$U!LuSxH|lOzeY#7%%QRs!3jmSjEd4x={0j`il%jfXbo>{?|C zGbxUH+p{JLo2Qyfd^Sr)P>M`saeY4D<<4)f_zgdO8?-cgcE8pw!aCUcwcWLr6M&#h zx&WRNc*$uZgjM8+O=qGoYRFUaOs=gZWa=7B;xv_|R=zln?WV$dd)0DHu*ZhaUXQZ< zKvPJC*&tu8&g;N~YPa0_&$X3FhB`TVg!RTCOBWpzdG{Q#xXwFLq0_jBtVJN1Hw=fP z``4>o1CJw^kaGs2*X&Eyo?LX;bDJ}@&P*g*V`St)H|KC60u&B7TM z5lCNT{o#~;Pv-Mi`C1+Bb^~{&7BHnQnB3tC$HsEn1VLTzTK8z@qXVg@8;#N>P+Fh@ zAV#+nXHL!4y*tye@T0}PHBoqL?C$fyZ{@a6#KsyOyJksZM}lGI*XWM;!Jg6DTXxBd@{yb#2i|QL z1lUdQ@{u6+Z{FEqiP0I$(X=pc(~H9%$nPIVS2`rwAN{z+q25r3foP)IjYV-TaCM>< zPMsy$a=B|rWPMO(T>&6t+XeBF?#}C1?T*eJ!wr$VGd$5;u@-M@P}zC>1aU=~MaM{q zOKFMS6s5M1)KY5P)qI0zG*Ud*Y8{P-`8*tC4pTf$*UZlInFP8ut|`>6m$U_LfwqYE zk#g_B{`$)^Z@uJ|0Eq~>mV%O|*T97&_=L^bFYh0pp|!z(;w_#b>mU%JZpH6uY^+Q z{Nnap+~#~scsaYafyv(XJmt&v2#Ow4L!wW6FA~3`KS3G0BIOpQ?+`Ha;T+Vuk6yRM z4Bd?zJlNIkOmNhjzS^ZT^obS!8U0IBy}lYZ_s%4b`1sS;zo5c=s;H3bj5?$}Y4XC( zEjN`EN9xVZNrK@Z9rqoEf>O&hR^FthNBc7p`SPUJx+(Y9La@4}eWjMl{iLA#iI@oU z7ulV&5I*4XD}H+8B<_=%csIX&lT43gJ5l%LvQGzlE2`8}p$!j~A{`rweV2yy_oBI^ zX}wn_)z!K7%KHsk6HZ3lC6Uv>^|I|Bs#s|#4ON&ql_eZoYx635+GPGmi+%_1-5Yuh zbK4Ym0buQCfOO%XqP2O|Da>j`c3UaoTG#x>%Eb-g zW?G^3Naeis^yhjI$HlS^4t>AK_HAUP8*pJX;d?6!{lg4#aa#Ek>1Ce z$}+}$sfYtU69T8HN5pOZ`)wt8`C`ZkAA)Svd>~o~<_8n!fP!Q>s8_W*G#K5stqe47=9-EQX9yVn}w4NJqr&+rged739H+?Vn!nbYKRw94~dOLKWlDKwQ8 zd4%IkMEG}C_=--R%2jD>z{Fo+$YAkQ&BjIzGS*Lu3iXx0s3ZOqY;EF5dz_gtVBBY` zOk&&%q!U*IQFK0bpL5+!hcZ+2k_h@(4(@*&g)q#n+SA zAbuZ^zZSr~9d+#zW9azw7S8rL^5C*q_xb$F_?q1Zd@dO%&8(sp+*cC}QcLseouxOQ z&{7>vEc6!V`5qXU(O6&jHfi?oCsWZuW{O?2c}z9qKKqA@q1n>oYl=MOa>3Tc1&=^D zz4f-MT?vVbeObpvUn@=S`rbUHxP9|nuTLdf`F1nY9qy}(gIbrkxU85LZiSa20aJ0? z-6YygymMwGIboqyqE5i0fNTF7zW!dzMfZ;Z_(SIa_32OaYrshS{Cejx+g9v|#9`5t z$M_Om1*$3ao7BLWtNw*x>+RTxI-S5hymZB9%dQrbJ{KXizmh@vF=7cw59)>*G}g&d zOp}t}&%L2Djdyrc>A+R&m)oCT8VR;K6#$g!rVIu6!PX<3F`mOhyk`jPy!Y(;t4rSj zyW`u^9pX-lW9!p?XGe-t@=B|5hjhTQre0$w!)iMkR-Ud`P+)lGbKp7hMO5@WZf6SOiN$6s`=zkiz?`{BKy z#Ny5U9XQYZd>*%86KHb6wtCf-DTh1OQ87k*0v9}IUq3PEjaQa{ zeQmK)UAeqOGqW^4~mqBplld`mu;u?|vlr%jBE?9TD$Q1EWx2`;+kLQ;_++%TIRDj9>B*PbK_Xsgq z$Quh&0yU)@vWv-Eqb?9ZoWoy|x#xPN{V($ig+dS2zJH>>xv?=@K*GyabN}4vHH6?( z8uNt|-yxKxrd?kh$r8bwA%ARm z|I4ikC@{#OpC4~4q;k-wFHO~5#y`qrckuOiTh+cxO8&j^WqR~d0_iOZh*>_%J&sEP zW**!lcJu8!7;R%puW&>ewUOuBdU4x6P5{!9+4Pox5LpRc><~yxoZGlBOPe)hKR=t! zgS}wH1LQ)j*WBMg^d&1>J>M(+CO4)OSSVlgo19y8`<5!n zTBgb>XgqJxStfGTqA6)b$Y2)$3k5&o)nmBM(vlI@WIf_gM8foeR{RWjN2b&UP5}!) zg@L!`1FR;4^gxhDXpmVr0OZ{s?#eCb&@kYDcc$86JQzm-*4Ie;Aw&p6k^CmQOT07> zcE=F%n(ijSe!_5xeS+a|V`GPL-GI$M)#0jHw)avvH(1h)tyY># zd!yxiVPAZWH=T16sn5^7-w}l*Nv&KhuE;wc5dNRJn0}cbV;t#J0r!Gfr{?bLIm<-S z&CRPoiBFQuXy*T+Q8U9d*)kkkIaX3fE1GDfCC<`(FnU!sKU-Bw$9=>(TFmN&qV#JE zJsoCYVM0RBqb;T{9i{xMT#ZM*ZTAnoAfE5vl_)QL2Y}+}TGK}uuVrCHLQmr7_{VZ` zvkK7Xf=;Na3(yI+TZ%KHckHB{@Ns>auQ-edv7)`u#Pv^T-D7E8+j-rA+cpd;MfQ+a z&Dp&PXM7jR6H>Q5h)I`=Tj)wn-aj6EL~Ud>ljs`W5zG5tHW?Xv?}1Ko(@5nY}*&%xnQ#?Cx`~&(+oj{6g#tjd zsRv=6rl~d3o(Y#>Cy`Eqthl#*u(n5DapZ#Pb3@iKxBWo2PuA);yrW#ZhiehM#E_SHexh+0K} zJ4>%Ths|w0ue1{i0z?ZtBS{?qQ7wh-*nM!=DZ27K%t2%e8!-Zj`kG?BGAq!>-5laD zRvTQfo9G5!~2d$RL#O90ZeAxOHCG4!Am|J*Yv^3%7ct6P}Gv}Xh&WfH9I=oZd?M;vVvVR zCzkpZTCNi4nZiGm(gxWw6DPXm>w_!a5Z-&*eS_sATl0FM-$5>DMYCiK*~{0tanjvv9k-=am2XJNcz1oAN#ibodO)Gx;UA8xvD{a z+dnFtf1Zy)M9s&k)mHd8X;vc>3P&wM=zSz;y^?o)gkeZunurLG29%j%xGG-UeyAhOeIp=Qj3n(!nkRa0~UvN+dtI|^%N&X zg4jd31OzaIHVPcgfc(yqeXQBB$aUWu{K1BF!=I6NX;OHJfSsh4s`(jKAAEwF4Ey!wif#l`)*^(mWT zt1BX@Uyp8$tvnK2?WyKg)0~Lg2iq5;;7Z#n>Tzc1+L?!vQV`rQAO~~qGlGg!=T3}~ zcQ!D1Xe?FTp<2vv>nn~WpqII-nvvCFBNY92Bu^975mbV)SgU74)*Q-*u~iSX1inAW zo7(I=vCf0Y+CVh7;k-@BC2!h56I#*swY3o% zqEHLG<`PK~k;*H)62^A=wS6v>c$LFyHoZ;~bF~dF;`>hhD{*mH4|A%*KHnR{A{{g{ z2M|8*xVYy}O%hCmtMD!A82Wa?e5Y=dypw*F(`zhycbSBT3^@)v$fZ9+(mD<6{t4I* z^%A-Qv8x1=t33>9Q-%}?O+qMlkh(5(2LlCK&z87TEd=_v0!*lMND;wEo{K?Zyx8i~6=LYh5 zvdT9S^i(e^tY-u{T8#+N;6~8X;~>9S+bF0MxG(K9ygHH7f4!}zi);W_3D-bsc+%0l zb#b%ip=jKP7`)ExaZ0Y3zA|g(=nrQ}27-W>Kn*#7BxC`Q-dHPFBGpVWe;-S12J52? z@ZGLwTvAE`oy-|n+kV+G^zs6*e7TJ3JPUUO4rQqg%^(#3SdAVvy=t-*)z?0{OmT{d zjUGIJaKW;pfK6}F&iaf#6pohGBfr3Yv zb|3r4)Sy-{+zkUc#`hu|?{jpLs$HKnO>iQ^@>vY9WS)hk_g(})BmEy1z`djc@@Qdh zDoA`ZGGUtW3q+jFpImSWUg-0JeCU-|yTV|sW5d#H7=BUFVbt@Spe)8=5pX%-ck8UD z>H;MLNmOcLAbb*MSku%MnxiGK=O|4fi($irok093b@QCRh-)+-&Dh(mEHTOsMCQVt zdd)iqn=klM9z_)#NOTGPZ%X2K!pox^fF@w6U;P&|*0qf?!*!EH zrZM*?g516r65dqe#=WiKbKo|-I^tU38!>+twOnAXp(B>A!_RqH$3q_P!>bAT{Ic&g zg1YuDBsvk1=H)$CBkvaH@o{4bFU^AAn12xvQtuIZW+z*pruiuXP^~ zvxX}t=B2QW#vHDhI{`BRj`sRx)|QI)WPAO_!yy;&OSrS;~0Ruc>tX8i(z3S2f@r^Fhu)nUHPt3;LK7ZT^xPgcpwX@%qZ9odtK(7*vy< zyk#Kag!}c5yz_1+&F6{Qa_d?)EgqKONGYE=Ld| zsLpQp>8?5k+|V^?q5huk*AgK6q5j6CNdJR;1Nr_=nT9MK{cds<4ZSw`{+f?vb1{vH z8`qD0C5H3pZ~(s!KshXlu?Y|7fm27rig=r0MSD)UjrklNqvT1np|$GH(n4wJw$9zL ztI2e`ToL!l#mXDXRZFs!2Vv8zxiaSKLZ^TnS)2c?E&mGHFA86qH z^%7t@pcIes=HJ`7>hld7!{hstae0{+hX=P^5ty(6*7`pC2C0``g2NG=U|XH5wG=PG z!y_@#UOydo2sf?n&e}3hPMDcyj!tOJ&zzfjZxRE1qdZlynGIMfg}=We)-kos zz!Im2^W|+?pef#ay4U~F+&7Kbrqo{hw0O=`k8efX_Q^k*`!n{KY5enRtRaTr?oD)9 z&i(BJzyB%$$He;jGIOD0u=S_f?Ys;B?9VSIV{(Su8D1{y%H&7CS$coY5e(mN=4fQz z7{nD9I!hgXjriZ%Ir+~= zZjiwvAMKwl$TFatH@Ck1b1vzNVR%@!4nIuJlk`kD_phG&$27cPhUtE3mOg0}7i{fZ zrd#!!T>IDRC6N<^mG`M)^3a&aH2!FtfcT#yC=dG#CULBi=$6v@Qr}l#&ir1Jf4v7x znB%@oFw&K@Z!i9%H^&fb&{k*OpU+@zjk_cKe!oDR<%-|o{RF7K0M|uZCfx+ppRW$V zQdUVl1RDAQk{aK8vi)o61;moZLU|e>*>zxLBB2b=)4{p`@&l+QR)$iH8G_^#L-zHn zY=6F{7$%>}XS!YII;@idl0O?3$aux%9ReyaWg-)WzJQ)1Rj+V__$<~&#{zE&-9`zb?a$U z=w_SbEH3|mAYf}!@v~!tM%W$$S5O<=`q4KMK%eY`8&xfUf!7`ief$CHx9%(bo}OTV z0B{3`!A@JtQf@<&M9h$EX_$cM@m|rC3A*swA0BfpJNBF_z+we!dKQ zv$ZsAKT_q4b)`i&K1&;B2LY2&7Yqyg9yAgwxj2A-@%|mK3d*3S#+5XuzI=U4NDL|l z(_{$JWu~U1FEgiz+?&1QZLD;HNL$HvumlyhXp6~tCAQhCt}0EH>y=h1l%&$mlKF`{yEe+c_NQo&88!JRwstRcng`jZQQrQNmAdgws@N#Hs5g6G!4 zb9(x4-}mtN4ccTrF{;#@D<%`~0-jy!b_w$vQvqf9`;&_*fhpqo^)Z#rR%q51gX?qGu;_ zbUMf>XU_4P!A50Znne6s#F8##KY>_zH^#t4K0DMpQ~~G!^XMcXMKL%wL5w@^#-o!A zDJ_J*hPV!aW!mYNT09fmUww}?mxr)uQRCrmZoWZvH;@XxrZt}t|9J-z>tvg14I4MQ z72Sb8kT8$N+N3^qfeLi{n}vREq&AR=RzbvFMQ?JA?EdM8!VqPfV3yDu!-9HT*8B`JcGqI zux z2M*Tq`@n8J0GX1EXOEyZUwPyH6i*-2s!kONp3AOZd?9^i@c005^OX0XlGP$`Rk+DA z#Ea?xtm(w3yuZ~RrgUD|C1YI{O>-HDKu-_Gj}B6>{->btzUQ`wYB$<>j74g6iYQA^A(-o;LsW6_zd-Zh+!uXfj9kc4qofmrRzwNhy}Xo@9n!m05E7&(?6;}0qomj-~9PVTUx zL-NO-m3jW=*|jk%@WfqiL?gJmdJRkASo)ISx&6_5wHNB&+~~Pp|Ea&w*pZS$tE{)T zH)~H_kNhDMS_}?JK@c&XshKY@^4*zI)xDWF9TK{JiiQu)xfc--nf>D$p)V%j1ePtK zdd)fQh`VCthQ+3ju#vHXkBy-h+cvHdnkB!P7C63Yg%G^-v9r^Cw4FkeN~cUWa1h|n z@AI|y!IN=+gM(ky!&ASZrp`OHJ?=LyoA`;fD@yRSmw9sR7C5A$I7`5>abryE*4I$% zIInCg3>QG6(B%MRAe6RSiqSdWV)LstIfxCT?lNojVTj(A+Yk4-^%?-@i3{>Rd?9m@ z)o2bUf-#9P=}_3bm4Y=byEAj_Do^bq-?kBzbuIX^qdtY*IvAE(Na`;*Oc*=byfPom z6kY_(ZvC4#Zyxr7+-J4Vcyy``6mswgAM+y?+fey~3M#y-TZfBxDKbD8 z1V;LbfQCv`-B{Ofxou^Onlx$8$>`h{{%D`*wW+pReD_Cg$NL|IEpq{>z)yHguTca| zwZZf``(-=%{2|+r(U)yt7AzadJ-e(?Stvv<&D;0B9drP(MBW)Xc5EU*@0@VMSvZ&s z=>FKNkf6Z&IbZ);0$kEzC)7`(EwO0Q=;_QR9xTCh5k}I(G@+Z#?XFg}i@9wN`kDM{ zYkPYc$MasobXkw-sRgIHqhcXP8#@y?8X#rCW zj+@gV*1N~j&&y!kBacMTcOLpTi5wVC&P)4?B(d!l6nQ(z`w|x2|Kqg0{Lo(}{aq6U zq4NAlHs5hRw@?i;H)bXb)W$M-etqpDozYx<9+a!Oo%ixHst}+ReUQN(bzakcw8dgY zso5zR@r~W@Gcevg!qci*>Ojpb{iChqjn`w10) z1%x-DFY~a+$0mEbct;z<{Q-yh4|EXs89^0N(CuRF_bCW z%&L68*O-d`i0sQry+Wh<#XfzR?ujBZbGcX&2dOO-A4F8y8rFONPkSZt@bS;9-gkzy46rXmaq)(=Bx!=u0OK^9^MsM)SQ@B%VihkckxZ}ptnAgl0I921zZl6#a`q>+7!q7*H@0KXSNVL z`M2I-^9r||8ija~`;k?Dfg;}q8-fYSrv>BXOs0*Y1HKXFDu`+9;LD$9jIR}1dlimf`q@I_gRI*9Ey$jA;gt(49yGo5sXabV~NH94mmwqU?2~&QfwT^T`2VbgB z$%(q*jE+waJ>)>F;&x80a1%}Ql^kt7&V0eF*(Lv9fyP@Pg8j4D3rY5GZw0V$sf@>06_$TVWShTlhJa>|*^;F-u%a*nIzm2_7_F2ZVVBPl?S&dii9hP}Z-w zvG=bi9BGy%(LMH#@W&>68-6wqE$i{VhnyFePea>u&z)Ujr8)<9N5r>}mw(?LRV=9| zaLC@?ljiI>U3>DR5Bf$^D7*T}%;YR~yh#XUjxAVdD|>9ld)&>;d=J*KTQ+IU+dlPq z*!H8SD_Q5W`p+Si!`yyT8hX@&YnQAY?vi3P`;CG>{!PChZuu4Jkncf4YR z{aBB6$5CbDL5Em&0g!Od>nVR1XmqgARF-wa{AUvoct+gyey+Z2meFB5n@%Cqp_wS( z8(Ee4kz3XiFMe#2a*4~iATT$OSwBJm;(kpjl)@6Q?cH0Y$a zm2!}|W%`5Q?@e_pi>WgwsXg0#p}Ln(al1H%vYblHQ{#a_P5iB8TJ*@>F_pfQ)T zCLNfhcHO*&EqWkxhb8ON$-OtEl?lOoTX`T}R!H$q zPMGnvV!0ekg;nx|O@j9qc`~iKZS5}N`f)aP1FG1U?7TPU(nBE&TvC0bQh|@cT!HsK zSoelZII`EXof<@(6!IW4RK+%=aQflKXXK`qKzP&Dkn1%wDl%TGNtIz_J8vAOEo|?MBCL~DvhL{=wmDW656GDF+;Bwb-186U?WU)Q_ ztSF^XlbV)T-`N}bP-`?_QXO9j^GyxM8zmIo+bOnfuuX@aU3xQw!fR14rR6D66sX*| zqC}je(XWyHiY$-0buJ(h=;cf3ky3GIwSrFZ8yW zY#2wg`rU;K&q&Ws72IX1fJd9y2@SnpWX^iV*(8K$s%k~WP74UWkYg=Yx$!#ft45(F zaSyXhHWQkzGbCwW-I_>tctOhKEk$eEG|$|11w)KmT7BemG6t4j`9%1Wr2c`bMqI11 zRblze8pa1ucsH`iGUCv~9%)%OF>SrPCgtLu>k5<2rr2Cf7-j= zIFTz4)Ni~{inz%~C>Q!!EypNTo>?s_M#(gU@ijBwt0rax`8Pw_{N;@G6vE4xpC=O) z-$gUlYL451UJtju%3M&Wg&KvnGy0N5j$%&WzIQCkH>i zocz=|hSY9b%GZ}(t0VE7^i}WMoTZ;$t4y5}iSoS6tFk%D9Lru~+(C=0Zn#~3$%sc7 z{GpenlS;p??O#|E_6C=`gG(k9^*F0c7=`PN{>Xt+XBSXTQkT1Zpmgi>I9(>X#evVG>S^0Vr#X%Ek1PDH*y_sU^TYnoNIUYA=`k4%0L zmzJs_q4_S621h=2JuNd-M4Qw>FGxn#`GwOZ6L|qWF$QHnZ3RJ8GiM@p-g&tf~@OrAv1Rx!A_{&OgR(t#u>_c4FSevppapc+uY6U#P&cP z*GMQrfoz7EF<-azgRGAHSGJr2-LJbEDH=0#7cd-}qdkGssH-RXoH#v}?39?;-`9DFZTx+!QW8yE!wcqT96h z15SvBpr8MY4Fq4sQWp47Mmg@W4_hWUX-wm%BeOE(#0Y(w;}GmTcvpvF>n%>d?Q>n< zS2YyGs1H{;-!10OF&YI_Aie(S!;2SrxxQ7#;{B1<{?`Hm_yw}7Mr$Z((F8^9?d@u$ zJ^a~fUiJ4{3mbM@Q7!CAYNuIli!Sloc4vg_g=)@c^yjO)=6c<@j7Af|9w_%aSTHUS ze#|&Tk(8yOCE)dW+bMnF#t+tr=dwzQ-$%{uf8G0kL)dXrNMC$Tk9VE76E=DG@PRjV z{>QQhw+&9>Up`$1@P+$3Le{E+q*w`TXaG{geoGqswY}+Iyg+EE96_OW_eYyS*{ram z;{^>!&2l0tK=`Ns4et6yfP#;f2DL!f^EkJ{i4fMsVwrAk7q zgYbzfuv|EZ(VK($tD*8#=yC-tQ`5GD!T)~AFfwix|Cs84?+`kJsLk?>CKGzc%VYKQf1lO# zFR)wr=b=CB%|(AaKn(&9TJ!q;$N&D3->n7#jV$jslH{So@|q513o(g=v>+PFhMZbr~bE90zu>x_aWBRQK=tv_wV2SJ#Qp&9|ZolQTTr_6Cf5;OvtPmV^Y}j}S5h1r!F}R?ZUlV@!N~RW zCj$9ib`eP;L2!qB2u;)!JXni~48EYq#RZESC7bN^-?udwFh!Fojc-)XKKMdntG8ZN zAH1ljDt=x!F?Jd<&BZjV)6ODHu|gqKQE|TmA%7lO*x3nSB&1*sr36YSTe1SN zW<2YO03B>!@XV+Ge!Bgh$W%GpW|TJ-D>`E0ldL8Uv|k}e#Py%oHGL624i9EMda|%o zKi(kaNKe%?p1R}aBo23m4wl>_;5_|zF_hnLzr%x@6L(1piA3}&y(vv=;#a=U)EpVc z)(77)MJDe2%Qy{$`Nzy>Q}-%KBJs>*L#&NLXQ-FP11=#<6`;%Ls<_|LoBv;z(+E6I zpS`|+KI9f_TK`IPE^nfi~cDmqMo?;e&$M-~XsDu%LBD_g_xaGaiR2eqC5 z-U}dUSEpxXlw&Ey+mV}(j+FoLoss=C_U%$h!vFI2%TfNU$=Me1i1I*FJ(KLz)Q6Is zU{N%`6m1|w$oxHc3|SI_+-o)}faFb~C-o>$|%A57rVndqDWhNw-|F^zJ`}bQ(ro|!nyXrn5z6MEJ z!|}nvwbl*<{eO&W1fou(D-b^`JU$H&ktizgG`bnP>i(tQ6pC+bWk@_p*pdzSQLt6= z%RKyd{y*N{Dz3_{iyKv3Du_V{C?F*vQX(DFDBU0pf`GJ=B8`EdbV^HiFFH&*q#KlO zM5ODC$G!Kv*ZY0%xj7f-PWdV8S#!>3%n|?ipG*INvMbr<6Wg<#AyQxQ+~~_ zmfC)gZ=z=rU03-Dc4VEJ6~4m_E_t@|VLi56smB*iO6xQE7+(R9(9rqq1?HCibTVBk zqgWoKVa=ys^}$Q)ch9h$v?Nc5#d*`B^&LlJt}b7z4!R}RYREkwt-CV3+n9Oat&fN}LSVmz`DWr7W68{bHgPrMO_nQ5-D$W! z+P9K;&j8Vs(wocD>ToJ(q@PaK3dCGG>Pz=FS+(rZ5efum_2mfr-|L z5!LfLwN>9XXDAEqgS$}UQ{$)}au*lZ1M8(uVV%`i7s^b2h{+19OSC@yDT*tZxJC&Y_THZW$oQ$9XcQdkKOZj$%V zwtu6P!(pGEsP~m85joP$cfbFofcNL{h)fbxO#Wh82~%|LyH~vz%F3F@p(d_|8*!x( zMki^KR5{{jWwW$;?COulZ&y1qvG9n+-1iy&CFIUrtIpIqKOvlI&&Rt9XPIsyp~MJd zX|CY!>^7P>4aARG{}`x#!PzHSo%8uyq3QkM3=o&_zMRGFFs}qYr&^j*cDIoWN(z5x z<^>`1Va*2fX3}%F*m12v?E%GD)7aV0a$uuKoI5>5djqx|HKrN)*HrsYSWtR~Dzj#f z-CGn~>Yr^V5DXv}S?>dDn;lRJTLvf0h6+Y?9n2rWUdZ}~{RU8?UC8TZmTa)%> zsfeic)~NK>ubgU&tma2bO6SNr!r_DTQMxf^FVE$#i*Q;mZLRe0ydsmE5cb$J&rwae zJJ^#|ZMLXo)-}Xo73y~f0<_T@TMot*`5OPEWJ>$7;kyf@!((;&HB5P`8r`JkAFJ|i zpoK0Om}}LuwUYE!IjvYk={ac?ia$m$n%=Jon(B@{Wb>4-FES=<00Nn5f78Cslil>~ z%KbE%!6l{axqc!hX8Ygsg^fmDUjS4VS>f8+g7G#RNl=;94u#`vZR@+pT4FX*s6*>2K9-$M~0v z{K=Az0ckLA-UH#Kcx!5Bebu!hlEbskdT$aJVFQo8=N0yb%a88oY%wRsd6_2+4(3j> z{oTX=T&wy1o4KK>l*DTwO5p>R z`gxihw{B@=q8>YL+cH(#+CFZp;fD)EOvA}h`sYS$FBF1*nsv|qnohiV56x|g=d|?5 z?vCF78y)fI9O0e9ns6^^t8zRv&^BXE>GIc3Nbc=jx9|5Ys1m=Nwxx^`B5ypwpbD=I zJAI)|6W$1~O&p+=H0Zhd+^-$1RzGlkZJi#&;%<+58BL5I<0pV^q&>OH!i2f?1m1wm8ItW(EDlLIzJDM*jP4TC@<*Yy?CEEC zM3dM>Z~W#Z#%dNfoZ%YK93Mt&q3-*^%iE+Jt zj5iIB9n1f2*lX|u%~y!`kXkrz`ov$QdYkfoOkGbj)fFT1Q&7woE`P$!8$EeeGcU^E zBCbeBZ&u`3xoz*4QnL&Nv^QJr!>C02zzdZt+yYs~_i_>vY}U?`5l2ppHxu-PZ;|@l zJ&*t7wWu3_gWIDKb9(z67I|_e!7IHC)&=% z8P8l2XtM~sN*Jk{5FI&QQ)x2#SX|SXNN_hcO*YRO1ey#t7rE`5@N%>(i{rbq!c5e1 zOnWF@Xld*j+k0A~?)|^*!VA$PL)w^E{OO_`WWBCNS5hKZvn=Fpd9>v7+oCNf9u_Dh z%p=JWoWuGArnN+HE{zy3* zaOd3cdTzBWOSqDmg5N9C623tdd;{oG3$(HVM>{`3Swg>P*_Se~BD&wX?TYJzhVud$ z(uDi_k|aY7!pEhjC)~~0I^xW;Q_2h?ZgMXsxGtxjx$tQ?6f^^Jb!(piuWKV_JCOJ+7C{z^hkE_Sz>q2IhI*y!-te0uD16*0@FK!w(XTOU=9 zb%I!GRLCCMG z7ga&DMUZgtU6fgUftT^_Ii- zs?(2VhP9ayYEM~|cf0XED#-;~! zRH?fsf4bxUK|I1Nxf}q%mEs&694nv{jsE@nw|jhyQo%70#<_uyg0QLq)o6!rWbQ%}#$^)_0WXZCFs-;iFJ=`|K4?P#UKA&-1eq_dDu0p%$ z>ACgJ(Pa{ctS2WbzD5W8yqQ72Vr;r+?rZZOB62?VfFm~y)1tw64}PL*W=|1B_x7o3 zY58y^{m13erEx3XNk~pMK{X@YL-;l>+{d=1(2E` zIGIK!(iP~Hmk}xovd?c2u6!7;b%o);R4#C5odBgoJN!#iQ4~w3T)Y1QkEX@po<^Sfm8xJ%yF=&d=N~FW=Xm`>8y( zvh=n3g(w%Pk_7{?78dy6lp!)uy$X`4h-U%-U%xjg5csBKsV)QY(Pn?GzW&X(Kk!_@ zZP~|x-rsUoMnW-ydd_A-$XV?G<87x_Ua|Gu*HN33-TGLvRzJ}a!AKZo6hn%qiMEhP zr`wmT3EI4#{>1xFUjJ98<-pT*4$Hv)@OYi)Rb^}V$dQBUIXxxow4Bg$l07QYJPRFf zWo|CW(9ue&RaME~lnGt6AnMk;ITQ83q%z>ukK?EA{m$1KP0Q`e>^l>(?LsX? z0nb5)?=;;cxG*_W%iTJNiUbK>$-X?x3un$_SGVWZtn$C9+)&6Fs&yUAi$5Jj0cWGz zxLpn-a^s7LuyP$gvlxz)KqriairrOf9j$Aj@SLsZR3Clx(S1FuV;s>tquoQ`!e^bu z61#XncE=)48Bl|OqaBDFVJZkt;sg*s0vua`;K#bY=d?Py(C~BRTjj&%P`B9gnQsHI2~%fJVzu`bMXxJ9W17i(xS}hE0(`LoO)kg;hrm%&)FyU<|=oa zh0_9ohNW52U+kj)8jXmc@PcLKCtnV%^o)d#nkQsxM2^HA9^HuL-K{@!PG%uI`YtL+ zBtsz{0m_wvCIZ4oKzw}cS2LocuP~77*)`LNE=BJFDOfX6$bED+JTQ0nmp{@cfVbXq zF2Ob5-h8*A#${u)3hfJs+7X0u8*}=R@KFeGp}_^A93Zv>;LBhEQ?#*d1KnCW+KxN26Be)cT+W03@VnVp`ecJ1b!h@dfO4ozGX{3%lI)Si_X1R85@ zlESum**FXg3A5eNh+%`|y?Z1T15&kAhuyL)OO2HA94vc=_4DqPd~Z zkxa^S@r`3iQ!$J@G`L6K5&b(zp=!gK9@tbJxtVkKbbYQ2Iz*9PM|=3@y{OWTZkG@i z0Qh$BFER7H(=|wR-+E^aq=bTIbuu|I)ubt8;Dq%~;yo<-C($X@97V)!4be`LT<}e+IU37^vk&L>AF!hC@RRyZJgtB{I*B8{Al1@k529@}XKRkx{l#8+J z#rp1j-4H=4duTHkTr=XRl4iX@Zt>IgH@7@7{f7 zFXVo(-k7faz&mg$zq+Hn2k75q!S~{`!HjAIhQceLq*Mm9(rQN;w`!24SOyIDHHyrN2h9JaQ?hrh!RT_zkmJKxRKv5^&N$0m=%&HK*vDDmT9s*E!yPUFZ` zDFm|iC8{>j(f^QR%(4r~x}I@VYrdjim7U*n2W%c9U3D$*0N+HOF*DI)FRKGrde&Eqx{+TDky9g8`L{GNTD3Hg`BAanW*(X5)4N+n2 z4qcIYR$R7@698Tj3Cx7*d_=+s7~tQ#@6ky}JE%0u<5&<>PIHGfr7E4v&^43UK>QT9 zhhtIcmLTUA$Txchp#g1-Rj<~9I`HTQL7hIxe%DsLZ#higys-&dOG3rK;pZ)q_jxB_ z?f1NndLW_uexJI!*t>2+y0sn^F-b1@ARsO>J6g5qgK zMmadNl6%>$A~M4117ei{NY4V=V>=)R7a#J#k>9y2W_|sueU^Hj1+a!L$|%0gSNSRz zKZ zOj02b#8GMb42Nr_ShZLlS9VD)F~^c*pwa`c2~;(!GssScN)S&38_TtybQpt%?5R2> z$4o0=8RZn*d!KX~DR|Qi# z@{l&dZ(C2&Uz%2%hrR3e+SFTUA1TwC<(U|athCR&8^xMGB+~Hhy#`uOAZMi9_Py`S zr(Xys_8{So00yxW8|!3cFKu@7$tg1L%e_asQ9W53v6L$I8A&>3co$28#c#4GuV%`} za`T;(5)1a!1@1%#K2e)}Xvs2wpGQCAZ$kLwXNc-%N3$x9hQuyHj%LT0vG+?+wSQ0D zOfH`p=Me!G9X7W`Z&R8!F7lv-{E=+73P$ zXAN6Gj`}2C6}N;H1Pt~T!4*&5yhuM&)Tcq|S&kU17+3I5i@M!{&x9M{EwKp+t1qM{ z@YtXOoR(Y>^p zj=Yt=OFLTO!DQCIzdYQx(F$ae%D-9}ml>dO+5clTHwQ;gn)qTH9dDM$JCze4+n_V2 zqNeI|aPFtfAn$6$ zlx;guB)Uc9N%xZU_3eY{|#BpEb)gCRm`TA-n+`8|P`EO~bZy#od?!-wP zX4Q3nMMFRU*m=mQEa-|rI- zGo*Kp@@xvpIFx;?)~ML05f+^`Y-HD!wUjpJwUH-~xgPN=OCgq^sD_WOxLH0UxJOYe zRm>bTf)R$JWEjj!nP(31QAw_J(^5 z0zQ@D@h*Q{&9%(WskSniDiU%<9CMFR!BElEq`H?f4kR%=9>l>o7uTg1n4aYp&VmKVRVl>eE{<39ddj8Ts~w z?yRI{k?3nQtuK+v+S^VN?mgXWJh6ZN`qy{HR)?NI4Zj6%m)K^wCdprG@!>ps!2Of| zuFGF_`QIh_IcPrT1t(SfFYH41_2AlPj;jw{4{A*1SErN%NM|aOG8+Qq4Nxc_1mp-7q`YzFo(*7B#GwSt)<*Gp*i`dsSR`c!MM`qSvRy zsNnH#h+`3S6A%*G$6g{IVC@JLQ;I_#yL-6iPgiE}Vzp zqnNh6wr2Cz->-;teVy2S@WE=%hv3W^HY?~NPFU9olkjRDkvueesM)a58Cz%R?&y17 zREgBBs;@=rX(J5E~J0q>%hbuBpQZ#1&emRLf7-e z_g~PQ3JKoB6&U`O8EOTqn_&QXd8U3Uo;)af>3Xtd?iQm*`=6U=Z3uHs3p z^|da>xjXr(3RYjZYE++|rnMw~u_*E^#iO&_77x!tNAw=2rKzK3|M{Jc;z%~d8aXxyEh^H+`_s(m^>KZNo2iol>%R>HE6rekmWs`=SoHZ4ML#*(`;+zgLKZ}hXu z$Q7oheySc7yR79GNq2H3AaM-UOZpEMU~Y)4zGmtR=U2zZ-^ELTX~9qZKG<=f{H3W5 zTbS}1*c&z5cXo^dgY8GsK)^83ZKGY^SPk3@FM@u5iT%{%^5n&9xuE4n7t^?H>SwMF zRo&v&3Rxlh)f;v&f0yjgDkj9(XFQtkBZePzG?aVzafdIqvBCiKfNa%d?$xOEHbQjw{$FTnX8C27zh8aFe>KtAyYZU3e2%9nL=;@ zn2@J~5+dR#m_^-rCQ&EbZvQ^OU~iIwA-}XEzg>WzIslsV{tAXm0&9cmQ3hozpd4W5 zp>E#w$pXX~`xJyKo1UMRk=vm6m>F}n{`mqYlHa3-Yp?v`$ksK1<$#yC)?b(kc)hKa zIuo2!w^t?gbB-?%^pxdk8MA-(je5#_=CvT4;1K8ZW$b_ZK}m11rz85asvZ$^`&>I> zMT552J>6k5ee<|Ue_WQq0@emLHdMdxUZ(Z48KN8|bhbJGQbuast4-Y3U#@g?l$}a= z`YD@;U5~?fUa7=!39G-|e!1=4)_iXkx9zl)Wmo9}MdPo5R~I6mcHY?Mo(0{!qR_=r9Cif)?-M8mvl8{lyI=ltun82$LNiJ0Q$^SqmJ5CL!fj&7)L#u zs8f&{6sPMot<|;T;}+Syn-zBVNfo{?f$+jY;}bU&p*{?Xev9 z$!}3L6A<{1^OaU?oSm#Y=;rNHO43V7(Th@m><^pVTkqYwvInEQIT6v1BEX8G-rISt zECT4>U(ocnHOi?tJ^R0lGA;U;XNz0?9Sl53m?AcswMwiO?`&(jwMM$?ZFLP8=oEl# zs*qmQV0=_m?)vL5rP-js-_ELsRqS-ej~Ur($Ao3TCH*t|Neai=8Mue*5avB;)5x=; z>Z8rutK6z6ay_7=q5y04xkxk4^1%|Dj^oylMw*GNR6dW6bp2+~{=r3_9zJj!Ra*t~n1bG*ak%`?a;})*#EK)mPFX20Qz;sTOwDt}`+o zW6@CWM2w7*8EaMX+mOY2A7A&qvpgkab*sPXnf>>Htbns7 zdH8l3c3KMjb6sBbLRjqI^V?%r3Phjlj+G9Bp~?Q{nnCJ=_-KukS6@M)BV@I8ZIYdl zaci*D-026j{yfkYTD8My7UQYAVZmh&ckP@<=*$w#FKw>qOzz$W!B`nz3Jb?mw>11n zKA)hz2fI8S_~JCi1Ha%u$whi4INU1HGvshHZRggQD8N?V=JFz7WJtCHkDf1ks8 z!}Lw%&LquP)Wlas@GigLxjQ467?`cryjnd=x-|6u`I4A+w-hB$M|Q$f8Y69K(Yy19 zug&qF6F3Gum^Js0 znwGa3de4v!oC`XvM!#y-I(=|@bInWO_-KUI=dk4Qd#$hj8aekajjToLk=};E1Se9%fV9r(iKwHMrCturSpa ztDgM=WOV&QTseTBnZP1|fk?qVagy)bw8TbUK>vGppUq_rZY(wF@@ckA^O|DfzS1*_ znyyoG&T$`RUrKP`Sn9<(r#?56zLdC6t#rcs_fo$7W|xJYTTvew?>D|483J#obQaB; z!F*DoQC8mFhh+AxEUqjoQKg(q&-PdPll1SndbUM}GLQM04|yJ$%g?qkcJYpqzeCn> zLw5)>FF1iMC88R3lI;^+T1ir#tPJ%`o zJyH^8AnH2nf-Pb=fhq$PfV|oRyBxRq_x(>7dkJvOW7_@9xh%EER245^AEgN&Tg^>8 z@kI+S-f-Q`ul_~8(EztsM@Aw)`^xu*fH!y8jI@ioBHobU_q2RvFZpBD`d4i6PA2$C z)IB{~RW~4IIYyJ8`T522CP8m+wBbhvM%P?rSG`PaX=;M^{VN<2g@xy_QE?v{Kqvfy zabwlqBXmIp4b@cj859gZ<$@^aNNq;~uW6v|Ok@q}#{=uPv>8HYW&}q{P2M=lDU@+t zy;}AY3`RgT=gpl0i*ytBL$rOmWQ0gK(-u|*dn%In{a!Epi<$W+@Oc;0B=vScsE)$g zJ@R?EHI+c0U#zVmGS|*Dqe_@LVy3G*kK-$ItVWt?+k=(noo!gtEi$8x)JBu|n!=`U zt)Csays;VRfW&^AG*xV-rOcq-L*xCkO)R$tuU@)fEk2`KDwnU(at|7OCuC2W1QNidJ=Y)U!Bp{ zFfPH%&qYAYWmM13yf5#nWj!lP1J#4avid$TFW_Z=PrA#+npCHqnQ&$4+*vFgL^4T6 zfqMka)Hx#~@g|<9^Jagh^85-4KwO%Ha^Z=>rnT3UdQfN~&f8TB4oewcA zS}LIrDtN?T-6ZEf7PDX^Qla>~3#nmp?PqKFjqK8`J8up}zkgPaLJLqTl%f}MMhW_U z*iScwyssL))32^H<>RVIahZ7EFa+hu5uI9__yBMdAOdgFhIIKz z@DMRM+6X!*sc)z>?7~Sgb!~MnB<6oLH4Lxq8FD&YDLoV2)8@F7x{52a( zNA5RrJp_t)+Ju{~KL3`$*w`m=%sJrp=}Ng*y^i!hJObj=`cfJ{;Yk_UQQ8Jd%PbQfmKW&d&s~VlrAdWM_+{^t z!YddXs6>0qqBm1lRDP+N`F(vn@x`vwfy<8-qZAwU3@WV>B9wA-f2Rm z7_QEL@68xQ((7{YLGRnOh%dNp7BqT^Yp8PUTaD}tH-&8j8 zMPrPXNfv^KP#=}U1nfzp4lJ`?#>iUFjw){*^d4KJWL?i9H;3-f3GZ&2BzhFm1<%~u z%o9}!CF~dDUmPfJTdU`(T!i_z5&P#?FMr!<=?(!tnTc$*4MR2%j~aDw8m*L>M$a?6 z6BTCNMPQL4+0^>hT7)VW=c$kO?+3nG&rhBbb7oV|iMi%pn&!B=tRhGzcvvb~Zb?$L zWMX0hT%^3VZyf9xg=H$9((s(*9Re2X&FoPYX|<&RvsAPV{{8F`xkK)+6mFb+&I+TC z()<&f;ql$DrQ%wy2o4s5k=Xofd){5Q8fjhJXHrxfMk8hJiUXj;%)n5cj$@bF{O+y@ z6&oa+HLMt$NQHL->&UUUhT2vC+XR2y=x3j?B*Vj5^{p?{5BKfxZAwp&Gl)a^FG@uv z`uCJ!`R@J2j2CDbmsF3{wx&VL(?c$Tv#gL3AtQ$zP4+no_W~tqv%tFnc6DGOxiJbi zMb_O5CCgTz+iOn&j?}ptEMR<-zmJBojRwrssPGJ%q+kqA%*fCzgz<1&SX;R= zI$3)O7Rf?Y;j92+f_t4b@WsUl9w!ek4;|~RFj={*vMbL2qAXFNE6KdUkW^WanR)!8 z7;AQ*rmo1o?HfK(M$3XGh7V6ta=BHUN}zk zD6_yfJKTSJK3Xm|d|CzsAY0F{2Fl76Wb|UhuC_gV<3$}XR&k5wm2KU-U`cz_X zTZYTK1SrId^7Fs?`#(;@EGpn1{SxH0Jc06&6G31!#a{bN4iXDX%-i5c&D0<$l^4vkaQ6{ zp*_E&j3Bf<*pZ6$&!Q&%hWHafVQGpH<-EMECuxI@-EmYhqw$dW z{ll?=ZUr3uQ-7?z|Lp{Misn5@A&(k1U%#`3*9UsdyN``*u0MbA0yQTE_wrwD#~WmD z;7mTpYyI;LkY_235|@{X;S5Ai4SyX{W|?9+ED*N(lzL2VMbJf?j{ncs3rB!d#m1d& z|NH>?>?T#EjmG4?M>0#P%_rFY4x}POfQV$;=P*2t@lR*syhsUY85nfnF7;JloA1@? zFt}p%7hz2-6ym)U1@0S+mADavqRj?*xKjbF@iXg09}@mNG8H)zvp+Ky-TgOa|J_AK zR@FP03XUy7;|DmB7(f2cA3u4+2l2)cSpKbg`S0@m`E}ShusqaH&ND;3`Ohx>bvXX~ zy8llf9xQy)akrP`nvhGx-=EfkP_%~!iap-vo8y_Zb-*pG3kU6%EeA0{xY(2)IQS z=B(ck&#ZE|wv%p`I0vwKL3_B}Y0b$%`2y**P+B4r*CRt^`54V!&7hgZ!9&{G=Fjdvx@64m8W}&7& znV}sbZ4ZH z5OCIZy8FsiYN9rA8$FSExXqKJ+2v<7WI4d^-GO6gs*xf~f{c_izZ+en7J=#$bb{EBpip>s2O|l6!Fh6`RH4(4o;oinIP>o=WMeaZ6!I2J?K5MOBO^N#1*%#_$~jic z$Boy@9{hSZ@@vRUX?!xx-=w@l%YxWsO-=J4n2CkY*%k71!cdY>YC%CRqifB{_6%;t zzPq?;a*20Z^ktgy7anJw_P;p=yYEdKoUYbkJtD@MXK&!zwzqF9TR(L%YW6CJl_u4b zTN+m&55fH`ZrqIVV*j{DRdD7MVB?rrzEMXF>jlvc$DniU;P_{qj=4Pr>D^bl!eV){ z0o|nLQ2EvYnto zF8KdECgcb%m0dF0WOT87=!>d>=LNFp1d+(hH~k!&LXk)2_1+ z7q{qxJ^Iu&=H7Dnj6eOJQcz#Em8YZg)t~5;(|NMlPM#ctiGNzj2S_lUcvpl;<-%k5 zp4h+p;k!BAV(-=0E1_FgvHjm72KGK8+8GMO_CqEoNqT0X5qgK)I)=pn-IA@IWu#MX zUDEvg?1*_I1=(wpYIe{Les8_|Q+zS~^xH>IFM`f;!dwHUfVNn;7r$e+aYpik=bF{K z*&!yZsk_PAWi)*@*R~Rk14}#c+^LJ0u_bPK8@@Kbo*`e-ca zMjlN@;o4?xwt78i5WhVp!p#zGsnZb77>VTh^W87iIztIBjNrp8VQXHK_dhSUmq;ew zPg_nU(lLJ7DrX!EN;mG$`ZBu(BN-?(q$Hsi+YgcRvL$V363g9h_`%ETwfPc{ZP^xx zn^rrmmA!)*!99wnhO46_7p_?(9{+qkWMnZ}qvVa=drapp`B{j+V$l8*QzCB~5RJA$sA-9Z&Ic8>bPvsLL+#u49i6*2|y|HFDdvOcMNiA)$|Ole-J` z4sbbKteBCh+~096i5X`{psO?zE}1GDm;`)@nCbTCLdt0h8BX{6O84?KuTES-^3S=)*!Od_#rVg6o(rP4 zC>6N05D*?SeJ|j8A+BvEntO1XtFjlu`)krpajIf$DG43Nel3k64VD~mCv~;W7dxHf z+9R*AnVI5WyS}_uUtX$T%bf0Pn{$=YVEKu|R3f+xJ{i!rycQ(TK|lV?(y!s>&)H#g zyLKrgQ+R5-6;PVE#c@<)q%PQa-ljs5xu1suB1r;kQ0S2B_o~jEhB`{=BKGpI2@{CY zWyd>&Ub|%~r^Fq1*mpnDX|qow!`>`sq|)ARcE^21TVQiGHg8FqVQBE^P?l4}n`i&E zd8!Ms6w@RdQ1W-Kn?A?**4{8C$}54m5!HXy{0CcR*8)ekLiC{sl_yp;w@s& z;*(b^$MY$oq7huKWoehLk>asi?+0FUYE!?j<hW!@sQe0$@V3tNfF36%5N%Yfk2iF3Zgf+F* z>;oeQ6-q4V12j8z96R_l4s%O#8~El1(}DhX4V1*IomWfe!BuhGx?D>QN8rfR^Pnhg zbBP3xT^_aO`QvmfuWQ-Ya;pWEJC^DN6oQ3~l(q@c;GJiHBpE+vji!SUrS$H`&yyH& zUd+$yK3kshXHAF+oC6M2D#p;Y<2E%a9vKHT#dXKnW#~8mq%PbjsJH4HGSm6W?lDXMfaidjs{veA~lZn^JNy!Z4Qi-rO8?!<6n=n&#S;%jD|X3RGnqHGD@g1R zgEGI#@6TB_N;>1Bi8-vIWjASC7@%hO<+8er`BF!tXfv~7GoTo~Y`J~$EwgV-eem7( zQlmPhiVIi|WDFbqiOTj9J#%GXX0vSA8pl_*x>}vr0X)VrGMgJms~Wk55jZ&cVP;fu z=5@Hc*6pORKwa5dA25jZh)i;l+n9CQ0z=Trt7e*M{zPx;ZN{|f>r6b3YYBT|+TIx} z2arWs39eq$LE??812cw0L{cs`9IK3kV5D-3=(EG(E~pPCCL^8xPZRInC9Hlah4#ngBl`Teyzn9T&Z$dDWzC;<5!kaN%JnE$A!ebZkDO$!o1xwQ|e#H@vteeVlx!W>j6+-1(sKi`Bm9F zxD%z#-K;g=EWR}AjCI7n5mvZ5_O8+*!erD_7@*hdmm-7I93O19bA@-7a(yH{G47BdHq~V00eGBY3UzprMR);hR|=q$I1?f+}&H!m;#t;U8_oa zDbRb%&j1U?0mm{Z4Jb&C9Ri(|J)c-^N3oPx*B>ofq~p&-A?ey3SEq~rsj;Z=sIV4) z@fnH~r={iDU3YqvXlNEUdVN}w%acbNk{A3gtlu$)SFm4P_s*r0Wj>6tWUi7Y6EIxX z%u&jV8EJ@g*{~Y>{t;#dSzIog>I30Pv?1fw<$aHO=$XHkUf7r$=~x-T?~(b1_1eo7 z{BXbH4gIaQjbu$wdqR87B{>Fo)IG33J#_Z+%BTKM`QNOd8NpPyAV(MbfWg?~ElN}I%UIOcZQFJ+$lAOrt&2Mc%3tJoh&@MV2O zt`ZE^8v9zu2!ie9VVOyRzBSR!`(@d{__>&%9>a5~sihbw*>9;!a8M^sP@RyL5LNs5 zJP*=O)WaGTvQ!Io9l%j*l*p3AnWvq(!^t2fhud!dKpi7UKaithivnQX*|UmF98uc( zHk|MVE))SN|Iv!N-ZBafu7 zg1w_u%KZVcoxyAKm#D8M7>|l~<08O+S*vcam-BfuWJkP`WfA6yk>FfRA)4c?1Su=* z%ihZqp5>*hV|fa59q$bU@+5IfhSAoL<`9pS<@L=E_oY1}_+N)mNe1;xcK^I73mU=5 z2m2Ci557lnra8j)LsQsWWcK$$*^_SC9DEXzDxcwcF(d;lv+km4)vg&Ng^r+o#{WFq zP>TGr%i63-@)U_BYzjbdcbr#m%#@h$!{W}PbXQ8N*UI2ni}k$UDhN(@OotXD_YE`B z&2E2>{t&0LP(#Jv*vAkA!Zybu=>4mPx)NXAA;QJb)EdxV@98NkBXiE2p7|p5l$yv57JUZ=uo6F~p5E`zw|4p_Y6`Zij^^&?NH1N! zW|BJY9Zj6Y&{~Iib;Wqa5Vl2!gzCt`$n)=#jix|=jfa#*JB0mDd3YB^tH~H(H}wLi zk}0^5*zvB-g4=r@}K7E0<`j*s!njRjE0 zLu{#vUql}SpU4HR{C4SZ)m30WR$?iU5!!6~UF{sdKUB~^M6oeLT{q*X*mS~}`hRq8 zPoV6urw5V@4HUSJ!xj9@92TaD{?>xP23@Gq?;NIQFuO>X{h_`2<%%RTCKf6Bc}F}N>&x)lATitySE?wJtJhGmSj3A2U-c32MUBDLDKVY-n?qp74H%F(&yp*aH|6K?3fYwOz!>H!yxVmISnDM5wM} zU@9=53^Cu4Ji1yUj=2;6`Oe-a00oPSC-n+*q5o}q{Qv#6y_;xn0Yn?|j``Zq5ZQ7D z+$5Z3<>X4iTBoPVVL|14oMdg?hm)26d_!+{s2!$GBQA0O`H>rTE~x8IPEF;RbtWJt zIVwLm-S-^GI3ESayKfDFi>)FFubFZI+hY;d|Lg*Lz`(y>8-D69{vF}*f$`4}*MP1u z_~+bz^6JK~J7BpG96%{N4y`6*n1tT^d6cYg@ZC%IKiyCx(gKW!0YZWAhbRLNa~vF; zKRz@Qb?$FCBtq6If3+~J)NYTL02g%_xG37+5C8j{A)W945)iy`(HopD8~1)M7(+v{ z7~BL&%wE$>5JKl1gj|vw9Oq++a&n-lHIpr{?4VoXKcZ5n@IJoZxeDJ;%?uKjxlY znLccZOq2e6Q*Y3OB<78Liy_a5_W+gP_Dh34Os$AjEl)f6En5SV zaa$;(ZE251r)&@yP&SBDBhVyuiXTcBM%6a%R0H$~R4cUA)YP&%S|1`|VN23$n_i#N z8-yc?rDEPY7Te8v(JPDskolsoL_BN_f_-`6Vz+W6dfaw5Wr>!EZbx~X;} z*jQUThD>!K>b{<1MD-Eyr|lS%v+HNc5dLPYJup>Qe}Ku!5IEjUzQ_W)*&yKeWgKdc zjb;iFN6C>O3BJQ0d+-Kog6S?o^dFO54#VShTGkvuJ=ot`aYAUDz!J6wbyJNZBT1f` zffY63#3v0`5l5nmS`9GJ5YhPaR*D$aNI3U7sRc9|z9MaY3tq7L&lczez+mP%ACYks z((|AB9z66fr62Mvf&zqa%HVV7XlJUUi;i91GdjzwJ`Vew-@Uyl^MjrBe((d?-T;#B z3c`m%H`B&QhpTF*N|c=fPscH5ID{kJX|j6v+YlP1D{VH=LE8e@@C(3GYgM}s2O&jS zk5%=VwnabKT!QIqn;g%px>l~amnziNj!Ux=d+Kg(HB%{^`Kv z&M*vI(y=LUb1AVmUAV;0HmyGo0%iN^$*_7kqn$F9YIiu&LO>6vQGqIqiaz7H5zuFy zMy$&#xugnJ*FDoLWTHdG6?`H{>|D&>Qj5P@R=zLS!J6hP^IG-P$<7P+dzT=>Tg#Nh zahL_7ZN`S>m-;-bC;geMeoZ&$N5+tnOI_~enw)=$x{YIkFdJ!R)*C6;NUxH+{rni% zjDL^(0q8-ORqL__f!#LWwl)MI@+notJ80c{)*&$6Q{u$}&$cuO)2^%Hr_*}Ra{Ttl z+0H4J*-*ndaVVU2NpE_O?i12bf0{^{^DpaeCQ<}wwuCOz7sH~t?PujwuiZ+!DnlEl z(~hes*F(v^_gVh*-e&X*YCqfT`}EYgwi9(uh!%BT`S_wugsX`0&D!-C%lziN*e!&K zUs*COrdpN0_Pqn?e05pow05Okn*07r`Krsr zJaB5=LEAsS>>asf_7HuStnmty+VvISdq2jCEw}NrgNqh%)Du{j)nl*iZB5oMsEi0? z{GqgvwV0h1TtOcqzaPbcfMtG`OK_yUYC$uwX7e%d#C0dTeU>F6qI~iz2>4OFPDWH8 zRX!!_dJ?Q)Be}&9iq;}aC`Cx>3XhGh(#hsLrh$C&&k;=bJ67Uy_aYN?_@Z%te&MX$ zZsC#xYWA1XpF+YK|BhH%SRla%bLFhpd!MI2oVwH&^|BzEo*~wM1~@QZSDz8hMCHg) zM}0ZkUt2r+t|Ik1^qT|ldq-BvClUneMjt^6Q5mbV5JT{0`cpThY_=+5<2GAO%r`du zv}ZG$DAhqZAkXMXCP-~BNag?O?#%z8Uf(~ip=Iin&LHQ=nz750RAWo1n6YIoTcpJj zGNo)Kq7Jf;F!p7J8B8156DeCLW+ZVi7&G_0@B6;4 z*L7XbR~p}5r}xk8O5AN@7?xz3OJn0|#;SJ&XDe=qf2h{DMaf8)xeLa{4r0YNH|@@y zVvFa_UQbZ zJ(eB%%8zG`CfhirX?hJNoh#?oM<2D*WtvR0g&Go_M>qv=m86?Fip<3KcdOTkZIF5T^_mhRlboceUF;on5z|(DLAbaz#)B@ zxdk17;qa>NaVfXUtj9_>y|J`V4hveDa`cAGm4gGT(8$VRNG@KEWMkFR+3CA^1jC+;5UYWTm$oF#_yL@Vh5lxk( zMl0!$93@sa`iR#ja>$kOuT6ek^N4cxN}ZS;OF(^MqYr!MSC|I}p8)r8S;#Ce ztsiXCddp|7Wikmy{awp9`}}6oup5(j$)>KamJ6!C@8?lG*5 zsw#0|1ZfHF$-5s`xBseS*1Rb${_59iNvB^Xeb6_356)-_CFVWP?c=JlBKt&$NhJ4K z&j#L}c(vrUL5yd9bA>zOE$sEZb|OD$E)GVT0q66$dLY(wUn)8|5=*q!?oebj=s0Y`Jrh9WzaJP%WL4||+?e^%U>ffSXVAU@}^A=Aq3 zaf>kO!EYM^HE|-s4*IlVA^$Id7klZ8w>U)5YAy?zgA#VbJ`{rVR$Bp|*|h9{5_(16 zUgFp85Z`LjYsNhIL@w6jJj6N|a%qHU)6J!$lJsrayku+JfGwgEP6l*#TCwEJ+J)xb zC8eCWy^q>oR7D@p{P`u){dpY))g62A$K6w8r2(UuhLzek6hihP8U%&}t$yG-LdHPi798 z$7<4YF;o5Aoqh$Dgoi1WE84!vP?Wf;hKMoWGcr`>q4#iIUkTVNZENl|0H?6<0!JlP zXnLsaN~?v}+u7Qa$77!4@I}$%BrO^IcB3EjIr*@y3!Ta&RYDdM#v@bM^-)obgacGg zcGBbV5Pm0O9PM(JMgffAs)S++`q|_p`=?^zC7k&zBkd#NVbI(}PWf@?cu2~n(LZF* zF@mx3qz&>ViNRWJp82H9WpDdeFi~e)$he^-j(ws2c)czw9^%81r|UzBn+uOd+P4f{ zeVxh4jum&gIyzBMK99|?Ut``|vG=W|li_OtV|J5w>!BEri*HsZ3$rPE`6+nS= zP&MH#>8Yul5)a*y{nt=wdNIW}cr(SGa6M)xusgRAPvU8$aD!d3J;&_unI=7Azv-#@ z!SP8n6_1yusgQ`XbX*uOUVa5kE^ShESpm$~htX(5;*G7p@2M**tc;GPaIeDVQjO`e zPA6CBu9${V+DCba6-_6^kLk_Rwk2uJ(*$$sR9N*hS}*EaOVHsQMktJ(VPFj0C z>!z{&PO{^x?VChV#5-oLkv|Xiavum&rPJze8OCGGypSY_y--_pSUuZ z#}7Ma+DWpEm4&@hRdy~Pl$-B_5I%LMfk>uausNL z*^O_A(>BbU7oZ%L-pb;7O*%qTiQf$4+NF1+OLJCfofI8ac`SF+pVUD`GQqE1rQaq_ zIM3K6fAr2GH3DxF*KDvne`c`y9Bv3(yQuW1VApK--GD-s)OrY!iOp;aQTb#K>8--A z0OcL54)#w;2`}}U{EyjPt*P3fxVzN|KR5h#r&p9WZV2 zDkIiAUvbSanVPeb;TU?%HxG_WjR5Lu2HxG6MTuQiy`-0CU!{*96J^Sm+ioxr6fZ1! z$S^CNoGH0#l9g1a6k>9S6|Je2;!@}n;W(2r^hd{W7dBBvgGj}9WM@nVbL5Sq$Sb_u z0hYnE+qL)A=U9<(Pg}S~XDWkL(ZTps9&L&by}8IkmxF~HVc%jTY1U1x(?;J|+O9oW z@whUqHrbUB1u^)@e=*3`Z3!z6)e+xD&zvm@CdX8R^UC6|%xv!A2z zlQXrho3d(GiB}to#pdsD+y@4mN4n}A1vv-&$GCPub5n!a?dFq&U8#KR3she)sV!t| zOXd@~1c{37-T`{7Dq$oHFJNAC4cExreyFzR^?gTMI;B%2h9DI|52r3Ul^X_;=G47L z%`6YJEtcswp1$Ue0856%J@sc_&-95o5R{o%g|N8M_(dum+C%$7?diSZH72{# zOZJ5Z5ZMQLh@mjjZ)SYIy<6;yo*J`zKsIi>ZalT3n;Ey`SY?- z6?uF&M{V5o(LkEgJ$?eFqB8o4_^(HcpPFgw@49_0K{DI-B&G|Z#oj8+oBr1JNj$r` z8&Z@RU#C+WWx_o>Y2)?A%;+;Jw55;K=qnS8M7ErEl|@?+Bc-;Qek9wRp(_pY40?1I(YPA0DT`DGU#+C$ zVIf(-u5y=UrIq%Zt<&_re<>Bv;L*u3+W>VUvJJ7)zd%?XIzdl z$$~2gT1xZca@s`(ET%Zo`|ut!VMAl@^6SJXXO?e;4G}KHJ~gPl-D5`t(#x@rNo%Qo z5$j17zKE}-gKisbxSC@Z<(jWoOPH3Xp+cxGxN8}HBPd-b21%JM8`P3a_KDq*GW&AU zHCCHot0Kmbs(g-eu@ROaOQ`ATGyRxAc}!h4h}<)7ZhN5Wr8-7v;HIf4TZqnK+>4@f z^8F?q${1lt@+Sp5tys^F^Q#>?mPOXt-I80J=Mxw#9|GHkETDS1NBcVSagkgZ3;S4a zg=A{nd$fw%yllfW!q_Y)ZXTPZm%~N)_cdwYGLCV->LaBmQfa2uO1n(H zX0cTP*NJ`Ytz2F!5M_dmQW=$$(Dg}`2k)*gCfe+1!;s#58xM@oRoj4?!EC-u#6+no zb1zBr?Dp!h+|G<`Y4{p*(G8)Q#0Mj?KJwe<&^+$`bw#zFAEp11cE2YIT7O*cu_xvf z9>0Gos{6D^kwKIkAWwCIscKhX-1Wb~`|U6O%~V4ZokDDqdgQ{Awl2R!V=-)1GaQwE ztW z%X2};PWr)$f0UQM*Fyn&@9Oep#C#t5Ia*j-REJV7_Q+ix?>>W|^HO~VljfMFMj;{RLz6&kQzr*CZ7Dt(fz_EHISF8E%M-TC9)ioA4_d>{yVZ=oXai7k&U0Q9b9 zpTA}16TtNQ3~WEy?Z%*e63|s30)H;*FqHY?E4N8@9c~jV#w-RZQ6)V@g1v-&Q3g`s z+HCQkM*HvO3uxqv(rimpA4g75u1ZitIv<=N%wGgSso1&WIams~ms2x-1C%@h<^u56 zZaaeMPhnYA0|&-erqyHLgK;8nO>=|G@0{`$67>3Jd`# zv{-^otm+i#t>-~Q^s1ihDeEu<5B21|&ancQUpuUCWr(jwq%g!4P36kOdHB%H(`oR3 zD(+`zWo_n<+p^gR4oi)Dpaoa#gj~-Fjv3jN7Ml}iVP7`=qR#d6GZ=P6SXK2cpbNB;yH~d$vW&h&>|7i4o?{eXUCr~Q+;F8h{GyM$5A#gc= z_Nm^LjcxM>q!+*eq0T9vf31)~LNK`yAV$gDp4@qOckEY0ii0m$4Y96Ig{|DWNL40q zlSC%~J$vQp*cIGEJMRzc6X?EmzaA|mdw_M9jpp1e81}vyew96r_{&4lq4ZDr#|x(r z{TEt`YFh3?2$euA0~_%-Bq`IWlq+B_jkse&d^JW)PpFv8yd_8+GKQjpWxzzF>O@1w zvl5@_v+#|->*2@Kc($~|*mJz{mQ;du6+Bdj!;P*c^*pYI`*L9Dl}?+B$;BzLTnVe4uY&LREPcz*}N8Uz8m0Ll~- zBQWTgtnT+HpSLUTnCN_>j!I}0m$<$Hg$~be%c)bRz(;$=&&DQY?54CQVM6t;3|*(B zyV!M7aU875ZTSWRff9jlGXp&Ht=1Elhcjy9|uTmRLt>m0S54uzVM9k>;4 z{8N@UM%SbA&m%}tEBr;m&Ze6XA-=hA&f=%TSu+#eAVQsKg*^Q2izKD-M90Yc(5Cgq z6(HUS7?K60iI!QYDKbM4yR6n_0?O9>G_^Z)6E~6qY!vi(Al2)~8 z{ULAF_t)7A;AnK;wz}2?WBt&T{ux}^i(x5nnZ{$W$Ho4{{|V<$S&>EA+e=j@hGO1V zLF_ZUWGa9QJqMPFr`9&3&k!S|Aiy3Jq3sgp<;A31td)$>@Z zdMd5%NZHoVFUUNH;830P{}z`s__=x?yfe`t9r+}eQ36xn|PzwZ?ejLoOs*E(PtV;g}oDh zZWQ!>B3JQwPyO)%A-~Ux6<=z%ToT-14f~R=S!7jemW*Hj5#WI%>qL7 zvYd?aFAIQ4PEmMxZPj;(82gZBl&%rN*Y-7Qqmchg%AuwY<|XdEJ-N#DNI(sYGsea6 zMhEeEa&UH*PQFE2Gz_$ECxObPTCOQ)MUDrf%lY?dD2h6gPg5_~-{hHHS3`KO{ zHy`9o`WSVL^igV!xP|U;qus%)uuen1NU)`_-pkNVw=OO*H_tr(4i`Fnj7_%5=Ym@! z^aRAGh=~4h>HPg`kQSgrFSe<7K~0V{xj@fGd_g1{xFrz(thmxoGAJKMh`MS+U5GWb z*4I$5oYbkRbV(XeuGTp~(P{D=9K&~j$pMw@0mT%e7r1C4>OS_GUM1eZz^g=8my{^Z zOYU((V4wN9RPuSNiqSm1;AzggnK#-`#9PsU0|#Zxw-Lu^CSnPA8D27fAaR5A4mq3Q zcKqG4LB0woXwQ|oPp;|3iQ{k%*`#^zNL)Tp#6t>6SHU!^Xm3`)spPREE7fUbV)B@cnM)Xkv>OeOkT@&<-M)ae6&v0Z!Gr^}4Jcy!$S+KLP1?P*i~22_u($P| znDH8(G+awYww>)buB@YIpQo`jt=+_*E=4=`GnM$UHmtG4c5vC;9~be&y0=ekvPv zB889e={yu|S*KiIX4D9lgodX0+p(`hpD#Z;t((Kt=s#;(PB%GdKKmesqZ%BWH&jUs9jb;;CwcdW zge^1|Q;G~*D8A0KqpXXyvF(^?`dy12Wv6Qtho&S-y7Dc4SJ2y=uY7%$D)Vxlb|bfL z5BdV;Yrj@U_p`7+^pTDyAI9gOT~?9MQFs$XSV3&@BL91D{MClD3+6_xPN&IK{4mY* zLr?-#(z+sH%`>_`-Ns0i23R`E|7sSKMe1ae-BXpEI;09BVzi8O ztWc!$%OmShtON`M_C{7vJUmeJGNyLsE*1pr9IOQYyP)XBENxs&oe1c~Yz$pYMNEzD zO-!Nq_@JC!oJ7=0*Eim*JRc6VDuSq~R7s;wO81o(JFTLm)>I^P zY1T&6A4lPqTEAO;;w|K2S0QJjLKB(~kQ~_yT7)|iTFBVrkB+&P6J-s{D@0ON$}I)r zQNJh9*HR?w7XoADzbpconaqq8MT{k1Ny&+$(D?bKbP>7bJ{rOMu})%f)qC5L@9ypg zf3`RLT;`4{h^ppq!1#fFA@}cEODrVVTCHvGN*RyrGgSDc3?rf)QL~vG@gS2^R$#>% z{U}6buj>JAE#qt{B|qO=@3s+MiF@4(bozDA%^^ZS=KOf`0&kOl31vnk(p&8W98a{+ zky$*q4rkn(GyT?fV^KQr5&xCWhbhy6Jo<5x3s4=F4Z~%3@#&I_JJ=1NNbTb^9GN^F z35Rr6TWq22F)Jix`Nq>=q$X5@)>R||X`gr6kU(z8p@&sBx+TBZzFHW|z(RgO&Jni2 zc%*;9-@?%%oBpfm-rU?F1d{t(USPGwdeT12b8lEMA6Y9b3_Bw!Bsz8AiR6d!(Im*g<*q9xdh?Aa z3(tOr-Q^n&%*Ji?J94sg)_gK+Z!FR;I2^4lc(T0`jt}?V8(nZ6ia66Q%}6ZkPcEFMsmgTHB$Ni|L}dBcTWkvWUPj`p zDEMRKl?)|(mrgg(B|B7zpR4!`hOdyhnp!uAf?$2y^BWJ1K@9cFRmEPx`{Q1Vmw-ky zBYU+po1dYd%X3SZLVh!Z|JmH)fq!bf-+?0|9wS;eG~k<+#ffWG_XAu-+E#`#wKMrYZ2YJGKdzXW{x8572^iTq*#8GNCITh~7Eb2> z8JP(f{znopF|x67{{JNYzd>wnGgZ;v*<_8ec6`ep+4-+{IY8VgZfpP3lyZ&~bR#*3 z3KX^YJ#)looJT)ju`wp*dR%WhYg!uxB&ur02#nwvK%|0jwJ)7=1`psEQyk-*T@ z!4%`OIQ^63)6;PRl9Z<~ERHQ;oDo^=K%Rg)wQ2yQ%xm!?<^a^p%@0#N4wF24?$ zz%o4mPy*`6uJqyr&;gsr`1vqSj;0K(jvo%dxHL7m0WA2GMg-#Y=GM~GYkjQi7#P25 zpINk%5DBb|O`x9b>sr7vIMaZQF;Fo8q6uT+3s?fMP5|jZH8R&U0RL^|QoznHASR8T82{HCZrJw1nT`lY8cvbw}ldIoSr6m^9JP)NojC@CoYeN)CgzaQ93!pkrJ zJ^0%?Bfhuq3<~Lt=!|OwfnM742B-(v4xFW1_22phXRm}e3-~K}O|&|IY5Xe#4A`7p zSxNjCIWaU8ys$Vkl8Sh2GjVS52G8$GZH*oPKE8o=@8}A)3HXONIy3MsET>Mr-v{-r zixkk^f;m3``pzzi^h{s&6P|)EiSM@dNBBw@ak>pUWZv#rzpC!6)>y=6uKj&6;v{?A_(1c;{kCcuBA>sy!EI^5p|iYSUH3CZdb z<`F;mtUn*%Q(-J{VP)wh@4W9tWcb7raCdql@4>Ret+Z zm|Q)f34O_5rlkgWd}!c|zXoj;?i9^mmEGOhWYMAC%?Wri;!piL9P+^5HcdbsfEnN% z1i&j(lkvy)Wv=*}wdtGHUt<642*L?OBYnLC;HR2403RI$A0649pnqip_WbPWF77WH zL8Beubk(dDU--8{5#hIj?#KWT=>B`^d*HZF_pjx9`OiivGalCp&Z+(hC>@v{a0MkNVtSEMr}J%kyWi{SRUR=(j}b=TD;5EwT9(B$HaZ zD^v4Fdc}XkR7aLCOmBE#5$?T#=r5h+_m;RTKJbUbrTQ;d695d2F7EH%ce*g626%o1 zw146w9{Rmt!2kXS;g?<;Ah|jyD6X1i$3JF@Ul0k-jKJ9%TpEBHoV)n=opm{0}}%#BVS5SW?;-aj}n@-mPfJ9YPWocVKg{b}CkLv!ta z)CUd@0PzfxNmxx2IQ=eUv#32p^-UUOMpZ!Ri7{8%0NP@TT%g8e-U;C_^5i}dAT$zq z1E{GoLf~ns2mH4z5Uyl9i9plUHe*l8y2O4;HRw(D3-t>!jzom)nHhs6s0lR%n96Iy zGS=uB%4K@(tce?mI}8vTiOiT=yE~?_TH)S&#lrM68TSHY5uXH_#wKIV?J{So@@*G% zl40kt_9j04j*)yvMBV%}8fv(X+{h63(^JkR;y;c&A*p8NR|fz2WA}EGK?G5Z^j0^* zbazK@-f2V1GFWM;QaT@`n!YKgI;X99yXf{c_}09NRGLL@Up1~8-Jj}^PCCVZ9vE5t z23Nea0Zx1bueTmo7yzjhwXce+=h~;Uprup8zhW;y3_-`crn6g`=0M_M$*tzUmKToy z8pQDVBjzX?*%9i5=>}FVUHbTgYjikD>(-<`LvL&Ei~*mKfcz!KP(fK>Ol%u&`qoa7 zTwtT9b|uUf;@(QWGbPAuHZH(}C?o&iM2&xl;%sN6huqI7nT>NJ$U4BeFxAU>+8|Km zha%~j*6$X$!T8<2T!^vhKIlAQE!e#`3lOC`gNVOkfx5F@*AP-f0Y0GcbrVYN3;C7v z%o=;BYfUe~0?!;6zs3KQRd#3fK#4z%PpQ{LCWOO19?qDS)S4!^SEHK_a3-xv0tahm zOkzcW8`}Qu?hONxCda{q|AFOBOP?h1(p{X`i|Ow|(ZNq$Ml|Nf07V=q49+hocFEfU zdn^n@gyxe+q0Euuaf%-8=FvNSLyVbOeld@>ll{-!B{kTBBlkx=7Im}smd(&s4MCcrdUhxPNS>tSLh~dL!?Mv~5g|0Jxcpu`YF6fsjCfTz;)&%( z(Q9+=nP)bXM^x+lBfl!COL2?1jQ4S0zsXWf=(($+Xk4fQNq>V1^H|#O`J(ycxa@Fk zB+Z~=hyyJ$Jp_gAJ<;DOw@m_PKBnk|%$u6Zv$@}S47KMJwZF6t#g6A_95LC3&x}WBgKC2 z)~>*?rT#VGK?U*^kjZAbL$ql^W&~I6nR&rt%|i2L+Is;LM;=&ZiANrezvAipEo*QQ zRss;4L|bk&>ln$;4~XuSCsl69O|3TZ5W>zUyC-vpKd0qd!F73ghKJ;qVf`)w?a*n8FkKlEXK~j?>R~<VG4T&EHaM|%8~Z&WDfB$l`|!Fg55Mc?Xt)A>E@-mQ*}Pvz zEo2Mc)_aT>@_bnbEHMXK4L)dFAtd&4vHTDy>VUkc^>1BEbrfU97N4_bCPOS|U}HJ< zCk@{ieV4MX?pD9f@(IR`(QK;tI{N6##X>YN2*>$5H=8=#9;`W`rqPZ3e)d)JrnF*C zN=yW*6K{O#@~gC*6N6i@_1k-UHs($}J8rB(0HiXX;>p3!<>{$xrV&(9P6e!@GQ?hH z#Ked)NH+Y}4Y=Vf2j+$sFL$CfPM$jkwky)-)-4nlSRU&Ww`vX9-8OWcenM&f@|jSh zP?yA7#o9YjxoM(H(L;Fy_msX&DVUl#k%MRp^mvZp53V)kTq9aIyYEQxKsahA*k<}UuIP4*1G_@7fR z9a`@~d+_&JB>A(?a$3;Lr6JG8D4XXJC+UaJ`wtM4H*NZ#vLY{LohCSyk#SS#{cfm+%fENnyh7DY(^da>ivNHw%H^G4n8O2J1p77Tb9$Ck1$rfLEB9qJHANj zQ2He)vTwJUVgwRQ$e(C?8h)q-Hl)g97wS9DIti|c;s6@e62dAb700D7uE(zJLzpRy z7?zFAZRe`~xMMk1o}zh0kWjP>8$YnA3bkxo7!M&H)&az_m8j&_C4{>3on5F`Dw^W) zwyX&zbFvp*(SMl*6bthSm;{l<@6D4|9gfceS8GC@TA7o|f`RVJnggJf$@%uSNya^X zcE=)&lxJ6OR*oysC7m!vU^8|oIkYfh-CCO+_kieY`oH&b<$rWFkqMqLo-}?$7^cN+ zS`&@f&=IL!*j!{Pj;ncA3}_Z~l)H>C6*AhZ-Fl8f$9~X{k_OyKw+SA4TCn#b-#}aW z>KJ(I{=$@DBexJ2BpH;-3+4coR@C@tdo1npHO27jX?;6QN;&!eX4v|^7g>8BYrwEC zBKp`Yp=2AkUu4dL0vWv(7de{z<0_~w4HFgE0)Yy3t4M-p)gQ>W1sRZInIU24Y1P{t zzII3x0JCx-q(jempmKU~-%p@L`J7C8=3)--;|*c0Mm!HD)|Dzcmcf5fL>KM9F4H8a z`0)pAj){z|D0a2ew)<}xt-f7Ohn_r79OqXd+!MeQ-bi?5jw627fT)GcRWJ#n^OrKa zj>&7Sz(H^%!1KVPV!hs@NIx4v-O(6(Jw8cS!~*83X88q%<_gDCM`w52zM_;9i2&aG zCu&bwY-pUSBP^@r&n$~u*GU}wjb@xkGd_^EZxvf+`0E;w-X)G7(tAVachw*|MepZx zWu~5w?2iS@YYxUX-F8JF3cgLiRr5|%8X|j5wR?r9GNjQdU#yD#r3=K59xE+o^Mvu# z*_(W;6-(*`90S*&qTYLPsfrRi_L_!fad-Avl0gY2zU!%2b9(19;EB%$k#F6%dHmH% z_!NaY-eeK+sK}D*X}dw$S}mWhy#98yV5@!{QD)mCy@Z?bS$I#UmR&*nTBje%r|3FZK_E{p{%;% zf8+PEjA>>P^>hc8!!XgRzEI5r6=r&M)EgIK!F&6NbMH{ovOwaT{ECci*K~f=Ci;4EZtf;#^I`A^YA!r;?B?gm)E)I0Fvh zyEc<+2w)|P45rK$ov#u^%$d6&Dq4D%f)5P=@v=RP)%oi^`V(Srou^K`xIp9~%aMHM zTntU@Zz=Q3={nrMC^S;g6?zZaYd15UPt~T0ab}Z-NI_*?55SkjRi`5-E$PD`I#zBX zV(H7=yCM4;?>$ufM4@zyD9fP#U9a*~8c3N|Z~%c@m?~lQ&0i|x;G(dkKqq?UoSXk zV6OK7{e|*Ps^D3Zv%UQToX5A9FVa?BJF>n-zaujVX!bHSxlJklVv zh|X@d{C=6X7#WU3Yqn$ZCuKVjOwjV%3KnQ9v;k_%s5q4z6COjJqMG~(COdZJ=}s0x zB58Oo$9E>FB!`EGsc5U+$~k>sS)?Ou^1Q{t5nr8rm3WpJU$gTsNYr&{b zS8PBv0Alr&gcjaRvGDn}wR)>yHaJktkeC^Z#FlIlMuo;;>Q4U77cBPp;qopy5zdea z*gUPl-5V#NGEq}s@(V$J+0Fj~{H-f6ROB9ejgNjt5!t;|%yZC*7Pnm9CRsw%Bd zTMPTPptZLSRJzcI>hh_J$mm?|-oKorXW*Sup-VK=gJ)3+Sel7ft=tDrM`5z7hMjCd zASz*HEVNfKBA#6b3nA=Tulle$=c#oqL+6={?;^aKsi4R~NhY1s$-eyJ$EDSV@*eED zLIF*(x#zJO%l@|0t%$>JplC&DIecBTaybpK6>jM>aPG*It;Ay}A5~K9&Y&^KBTe4hYYUdBK)3A&evNI{if2L=?XRn=sW( z&TJ1-$l0s={88LJ12gCLh{@!z(G2SqAi||{{&mAK`8{O5N>bdM=eYFg3D7Cw3qnty z+wJXj?8jgpE+O%bPThg7JRbN_wgvGCDxz!%ZCkocsr%r2bQk7ue@A;@J6Y>DNhnqn zceq!_YDsCp*|bc)Icf1v+K zeF!oBp^d22CFv#NiNH|u^D|vyPUT zlYO=&>C7fM3?9RoALHkXH4;1HJ|dr+>c&pOM2-Q^Q3`x3k3#uTMCK4opI3y+Sn%mI zy^UuiIBLtGJvL{@?biy%{&x<0#>vH)2Fh!qD>|eHv})|zdc~aB6gkCS2RNsPgcr!csc>u&`yM^zfM@14%X8;+ z`L6x)sGOdDr*@p;b)ei%y_M|Z)`g&NZ>bm0UoE9UM3h_fkc476H`NW}FccNurDr}T z!o+i|Q&5%(ic*G(fjrZo;dN)A;QvMr2t+bKNH@?!o?s9#eh(v9uCUhI!kIK8y| z6Bj9=VZEC{3UZOjSfB*Drwi1o{eQ@!@JLk`4;_B z-z>|z2PG_rQcA-;jpP<|D7U03CFkQ9*a*Wv!Eza1HA3~HzI<`mvX z_9tIEhr$_|MLX|xqAe7PP zW!PzcD`oR}=C6 z2%Rr%HxA@zkErUALYYji*K1@QzLEqMM90BTB@hROFe9%q35(Pm$}{YxVl?9&XG0i3 z>yI}wz3c;NNo*o7qn+I-m%P@(XNH`9qt?AScP9^Ie!6WgcPv9{Hx43!iRqcNd(kb1 z7V%7=5tRNuq-vmup!ekLXs2J1K^2%4&k%=g6pqXHR-!P?5>@5fHfkN27-tAAWsOn| z?nWrYf~-x_F#W+?q;Dzi(stW)xB$jluaa4Mhbi=hUV zxNZN=)}O!TF$j@bFmCXc!V= zAEnU7{1$ia1F8JxJa-$=xZt{FJCzwylCj`Lk~?(ZTxo9)g>CBAb{6Z!K?h?`g+~XB zjJwP>aq=k5r?rM2+J}^LvKjV#Icq8Ac^LsgaeT7mZntj{S>t@$Gn8h%S_MVWDEnHx zigiNr8_yWW+Z}!4)n)#L5|STI8`2T=9=-iKQHi3$Nnt9Q>8|7O&qk}hr%X{#A}>Upk+cxoI*^hqX*^eIY~?T3tEuJ6FuII5NUx+A?yX?&31C zqIJi=>Vgf}&~nryl|hZG+uk~(xt=nv*hS)okUUmdaPinljV67f z5^HG^YAXkF6D&B2;T9JL1iVLs=?5v-Wz?a{Gug8GjNIM+9A5}PRA12!Vg|) z;9Wo~t~=AfNS5p)WmKB=2Vz3#%G(u8Wa& zd;M#PxLZoW?cqSdN(Yiqz@%8vo|T3bRc(M$3lu3MoSwuMkrf6dq0~eD?c%wl86&p{ z%4K*nUJvJFDciJ?Wh}Q{Xum;6Z@qVD+Wv{*V^ghm(86MK1P%_eYZo9Nf={$K<(+RifPRbW5-Ds*cP7cb5usJqUVK<^d(gZ{6TjY&hqzh(T^e|KLsfRmb8q7>rQ!E<;*mEOCW+hK~!@eERn6qW{E z20hBG+`#H8gF#<7yM-EdC{I%Zi^coCl=BMC#qpv&H0+yEW3d-nMaA!nEtrbPYiRbi zRe5D!tqfJgdX?8Fdw!cUQehQOBw2j;9rW;mp|{L0JAS%d3Ww8-TG8!B9MUonTI6qX z!m9QpMnpCyf}1218HGJ29pkT?eM-BHxNMp~%9jyx$bXjyaS99OLQc7+ka2KgGU_xm zdk@d)1U#^;CG$#rLBhKpSk)b$ylw|QD0c`uY%*fki#OLUNJ=+$?GILs#ukMga*Az` z3UnS{4uyCqL+CJH5M)esi*&h%)Z2|bgN@eZVzY(-s3_)1Z< zXOWfnW4Br;C7uCRHUu6rz!~ojV`s-x%Uc!hID}E-S>ZbP>7=B!XrY$_tb)Z@{2q|s zsrYB(iPaRCaF|8q@e$crk1S_TSzI*yNm;_ol2=xAt5jSRJj$;H%0IN=nrEn$W+|?f z^kUZR?RT8!>3J6k`sz~I1Cy3I@_ut$f?UHK@b|Kj=+^K%dPc&}hz(%!_fcU@D@t}A zqr(aKDL{65Cv4M>^->5cAJMTHS}qivaSL*Y_-4c;M4>L@42oYlJ%%0So)AzKl8}Z} zic8~|+k{WHY^gadLo`rjh5lm~upV26wRb>y8&}83W&C|$#Y+uV#{6a~j%4CnVmvfc z__?)JY7QtHmje*07Lw9KB@Hna2KqBX8)Ef~ z!5HyfvC07mg3-&`zpXp*RAIDK@AHFTRZ_6&ve;2OEEYf8Q@^=0+c4=}i$eU2K)AMO zBPr8%jT{E26h79YC`i_OG<-d9Y`Z>qrajp2i4a$LdqTHa`zC$-FCqRvY1 z?$6ZK4md@+L~S`KG?)zfA-**C8gOHpP4U0@8|bUYoQL!ZW0}HmoKCUxq&i~ti-1ld z&`VWtp4M!{%`DxW^0k?O^TijNpIs0!cJ217dU|#B`X(>^Gq$^z-v-TVMjfPDIQ8JU zDnR6ODu+?1DO!{wqGA4KgH8YdBN?A_0h4p;i~I7sgeP#Of^1zIHhr2? za^lrV>7DLWQsC}VS963CoJ3h);}Zya_?~pu)D5jZ5<9rbaqn@vV*GMn?l$!MLnx=vB8H& zhnjI&>#fF&Y4!mG+ctp0EzE4Z%&}!z#N`94%<_6Fw232ES5nSORt+dQJo4VPav+i3 zunHBaaSFL!^ee_3$&nd4g+Mra{8jcovI13+4F{%I4V1&k*r6y%h;u?F*q!gEQX~pQ zJAfQSF0yO!LjTd~UA1S=&KJ%Ao)Pg=-u~VT)RKc1BcV&31s&Sjr)jq`Xd(NFRc{wI z9dvJd*E-=-lrXRF`xy7|WNo9Y+bVs*xjtCFD2t{=wWk5kzG-Qx))hV!>YK)#P{^T+oEK$UGW#$y5?`-Lz_EZ^LyQaE9u*ZE0Ec<5*GD+T;ffhp<0G4?;r8P}P0cvfZ3jQCb_ug5Mz?!%-osy$J8r ztC*x<=UtsjiAzY}P&nt1;OjWtK1P5(&j#eA*HHF=~9j6!=|S zjiUH!3U!2bjifQ-hS{55=-ZX)=?wL~5|>9P+`rh1 zi!DBHw;`Pa?1(klcEF_~Ji4SiY`u|&6EKW5{60NS1Wvxar$0FD$!2QIpo0Az5 z1Rq&FjzQ$)=cD-sgRyjeGIE1&b5mj01nF-}u$Nc}>LblFh|9$`8<&~13Hdq48#-XI zrG@KDnwnS!78zB5qo}UeSt&EW2c2E&pnsyKN;6#``xGYmj%-Odwwh%LE%&24p+Ls@ z6=KyeNo9)ooya{+z+5-w-3ENYY)W|@2>3;}CaNE&@3akeLc#Ik zX-JJA6AoHSK3$ZAZ>23&c_xn;7nLn0xvj$)^^3n3d>)zuf^p%=!J3LfV1Lrop7NoV zC8N78_(gq&p_N>6kj;{b$Lh)er~H*73pY<&${o&yxGuY9S8HTKeQkvcb1HF}hy}sa zU#YcZgglW^jIVS3FDw+Oo2#CL!3XI%Bv`M=*{lHXsoTCv0Qbgsmu|aBPtrMv(YKe( zE9UBc2>8(mKV?fpF@gaG}Aa48j4z|>9DO{2Yl8&jm}R<4YlJq8k^gqhm7 z7X>C^HAj4In>7EdPv&zAnt;uCc;(dEMd|^v0jt1v-5Wv@*X`*2sW!v&g}Qgxn!KEH zWYoBj!NjI!g7EzgiWyzSP&gi8ZA_T9zFI4RR)3}a2+I^C9k@`|HN-h#nTA8;@J7Tw zaoqe38_ordXdTce^{tUq%tsyF`zSJ)EcfGfC)oT6aV;=`KVB`DOo^8}DM(!8<`pWFiR02BBl%;oMwRFW&22FL! zTCMJETy(A2^{SpoRV z|4jF<{hIM2H{1$79p`B<1-*mB|C)b~vRCKEO&i`7EW{+V7X~zMD=s2dg>bBC&v!{j zuXi=6IcR$k0$HzA+^0j}jD7tS_hs{dF?5FGwyC9c$c$iVzSt$97q^<66ZBjlgY~K*H^`h_4I3W!$f(aQB#zbZ z2K>!gWpvYI)3OXP*C!zmO4AklDspXI+NA0pN02Rke&@3KH^06%%F<+NC9Pi@aq@}? z->RD$ym#a@GK3r=_L~uNdNvgdtU^jN!Np}j{_-S!)Q25+CzZUYp{6C{E;~3h0D8SZ zDb0kTu`PnZct4`Al}Ua32oz5Ad|kzTrZ7rHOG%u-1X|X0hkX9%>^dfo4Fp(n`XCC7cH+y$VV0wxhexQm(4#kt(<5! z`w_@Fv|Rm~4A7fyb0>Rd&}d*kN9A@CM_19FR0JZ__~RbusX*87 zajl$5F5f#88v`S{kensv=6fX-mu5-5ZAt$4now5p!yR(IQub+^0bB2&Rf7Rv)V#d@ z=xQN&F~%De82jWFm%qf@K<`-AXcJo@-VwqeNEc14%)q)x1})%8CI9=@<(r}7DCQ(8 zmqGeaK2BWX9hOi1FceOgPC3k3j2-ttTI`M&tOhp9J2`1Lg0Udr&5~%l$tumD7{6?l z93v$N-pLLX{hhyJ7@Y+;MJ07MyjBQ;;S%5J#8@N_P8v9Un1vu>!QNggO8Cs<`@J2p zk>;vDQN-;BHnrJt9Zj%>;Kh;ro-$XR$UL^$k)G_sK!&%_S|ole261D&8}p($Ah)!R zt&>y&uMf`!w-@&YFpbrKNkchXpK1{e>#6qw)XW(|@k#K{=+3wBSp2Z}17sU1{S zpSi)cW((0Qd!}REo9MJiWwF; zbe8Bj9b}sXQn`tbTGP*|96BhNdj+j29Zs0u$}Jc^)TPNw&Rlr7*-p=w;zKbueb>}Q zPi9j^uy(=S7jZnVTmJXzn@xYV{y@>wvt8-0%fv1Oh)3}FqlJ`wkkO*{^Al;Y$|@zF z(55URxH<&oc+E5ntirQKqpw6*?krXOl8iMzb;$_#?K4b2tU^GfsvVkB&)!4I`UQis zB^0K*C0eFEj^)XfiYRvL-CSIqkVh3S;I;k7;B#TIKwRfn3x|pcX@Wwl4vAroXI<+O zk!ng-sM*iok@t-j5=Z=dgCgXT?k<@F@odc;a!ded7$85n;R7`g;GR#Wql5xhP2Xk@5-oJ(ZNTg5)14_;9%NXH4NvyEba$v<2WG3pZy zXOqVpDR#IZ{~_uQva5?iV@?>MO&nEjq7m*CpQ{c*4xUPyy<JB3!N>e$ zbFVQ4a|Dqj^tH>bg^0*Tlgus(|7?B@~OHs4fzyyV2d zo$8HClB5|wx<0cjv;FHccn&~usm8ehGr79~!76zZSr>Ybi*(@9it`~-IT2R~q5Dfn z3S#*4#k#GT<+AxY=T>xW9-}6bZ+kMGSgyhLqW##Qv8FJnGLeK~Z78d!)(t~+bwcmiqAy#8%lD#|6%%9e9)0SO+zq26xl=21?!Dz9s*rgTbQCRj%N=!YZHU6-QdVNQ zT%}N2H0u>dBOq81@Qr^3K=)MUR4Up%uaPQHy^I08mi9YfjDCu%+kLAnSiezgeT+*n z)_ZCl&EVqZ;+Va3xFMLR#M~5)s?-hYa+$YjhPe(IXXqIlTsy|H#8rCn1CZ{MW1pd{ zzwl(vc{SXzULh{4!!PB%T8&K0U((%pbZbkYn|If@d=J_9f_XgAPkbjSvXvhf~LOdJvd^5d^ zv<0)+f0yIvY*rOZk$_S1cG4HENjP}Ol45k{7=Ygo5))fz0FCf?9&;YQ0^82-M=s8)G=bRF`okT zaOm;EN~zv&4riUy`zCuTgd>IYcowWqvq)F!oYmp(r{s~7bT(T-9tH=zb^M`jSY^aK4wNxWA$aEkJO~ zHUe|pc@@G%uH~A&P{KKr+a=4D!K>fMPZSO6gFnptbVtEmAG>H19W91}2N_&F_TEdG z5Tg#Iv?81RvDu%t` z+=Q3SSkR)iV<1KbUlzJJJ_{N+(qibOjL9mRt4l)PnN{XCWk8qOeOj%jgR-tE(tv_X zBL@^>A_s5Yerm>sR^P^1sADZ-L)j!u!g`)QdfAVTuOgJU%F`MQw_7d&>uo`{u0i7T z#J@IZA*;0$(a*rS7-U9{x`GDjd3-P^8coFALRz^Snud_Nwaj_$q&F}`nIU(cK=ey= zv8YdKI-Ym85IWR@_!qZGMA%(t5$q6iu45UnzE?bf?OdjQJl{FQ9gy;U5zA)GMSw1q zMp|x7fw?(Psz~QF`OoEjR!xVBhAT&@lwu7+#gWp&&-V$vnm714|Ao@^pQ+ks)CBnW zsS-@-%?;ho7n-T*55h9<#2}6cc^>!rQ3?*lo8c%x+rv=F2IR-+ze5`ldyes)6zaL2Yx9mZ=G;_r!tMRAwub490@ z)w%tL-FcU{xZ5=QaOho)s`de;q^6oLR<38zMs<6Hmg}C5idB$ip=S8Sv#%L>p5hZu z(N?kiKf0DV36h7~?3ma}bnPo9;%-Uw3st7E+c+!=Qo1 zT8%a5d%4^eke4BAY*11FXh+YLbW>e_%~$_7o0oMI7^4u4+B}}ZGxflfJ937n$l7y? zya(UaxfE9@PqS*m<}Radhb9RX&uvsIbQVkl69G* z73*sk1N^rX)lj>3XmU)BZUVKvUh=6R1Sl;@SAKDRh#%LhcD9E%WJ2rtIbPR-bJ7pX z#L~jp%{Az`M+ni4q!!Ph4^@0K*H7aBp2a#V(BR(nwZV8D%g0o@n_Ewi`6i`3`JZuC z4DZu?N&4bqu66UvQne%DOcd*TW#LuB>DzXZz{i!A1UrC3GN?1n>WZxYo3j&!EPb1C9wvvF&X#Gkq539D%q zQ@Lhli@?HxPN2<`i!B}K#VPc}T$i7C#nN`_&|OE8t^@WYmWQ0X?X5UFT$kKYuOfuP z4p(*Xbc-+(Y9-nimxlPgm=otY=E&N^q|X_sb`L?qG|`%X+qP}nwr#un zYumPM+qP}nwrz9%&7GLV-OQ#Ua+iyW%s8js^V9+Ql*1NBa+a9%_`Z2X3n`{Lnrc8) z_o=uLl4%Mb-k>e;%4jMA4Kd%|z*;nibiqDH(n;mhS&97jSvEsBZn9BNtLOGaZfu_T zxK>3dFLo(mIasLQ5`|KaGA#Z&nu9-$>BJY2Q0&?F=brxf8uu&b=@%clY&bfR(kLxw zi3)BVccL(F7N!$E!|Xp8KJlBi^pN}|O+>+bZYHbeu8!P~ zMf;MRCV1UnfzO}9V9?yB*zYFCy7;zD;ZN@>AZs#x4^g5PB(ong5cX;STXZcjN>9WR z%i2;R(ZlO>=_rLWNmpKPN|El1$Ma>Fmg#;S->g(iYFbbhe#lCDAkUL6vATyb{}{+T zil!bt^p5E*3aP(mAj~U+66nmJiLC323xPL=#4(YH@6m7FnhHj?Q=p6jR9Y>8s;Zl_ z*cptgCwgPJo;ix4doMItXGNNLPC{PVob7dzCh{#jbBJ!2UV5#?>)pMK(>ZI9GNO9w zmzb0UsclE1(T3%v_Ue`s-H)Hb`6y;TA`F9NMMlavTp>&!OxDXRaXG z>%8mG5sdc{S6ZEFE&Clr?pl*yQ5QIpd7>h!Ol1r`36NEPNw&Z0S`-Tl*m_KVY1>Z< z)wczes=@+B`yrPJKF!d^bp~~y*=c3@rTGg$H^?`U+e`GItIPwZtAmbYt`!`6bDWFf zx1E`lFn!OF6w`>faeaa~{7rntFmh2@v55-U_cOq*|B(jw0?%6vmd_04fP~(_PAD0 zY?pfCjSb@PbkVQK)%#&GV>x8BK_Vy~$ z)5|x9#(hDWD+EX;FC9ZrYl$(WiMB7-5eHsr6C8uXV@GfIEWj`WG`*P>7LK zIR*KvY7o-ka#LM;&PjSb8@DZRX-b-k*ZZ71q>R{OpTB~73B`&9EZ3inKrg7Yf2&Vt zJT&yp`r-6E1w9;FggwlnqpyW&JFJH}`cuP%264ti=@0?iMjv3x$79kz)|~d9aVpo+ zJoaX%5528b6+}~6fO*R@oCaZp!-E|sv>WJX2Xi|YQ$aBRs27WNEy55>&{b+ zPa!9(P@nA0@rY!0%gD*nR1(0RXLtG$-h>5d-3zCMA^?ng(JC^{7sz6j$vNc-`XDaf z|R@Y^IX3=||I-IiZ{-6z!bFVK$0h^W8-Vt}Nu)VM@G(9kMa>4N-DqXmn zkXRf{+%E}GlNd=Af*9HB9L8^gics}x$1?rjgy<*m$t-&YBm67^xvR0E&Bnu(VmCE( zf|U6$V~RI%TVSjD`yZ}_B<3Ea-G`fZD@={V?Ma%cCmh?cPS%pV3d2v|Uhec&L(l#z z|Fw7L2Q98sLqZ1YkSlQo_M3Ip;Y~uIvLfqu1RucKlzoq3a47$b^v~pow^w3KF0ym% zQS{w;Gs#I-_A!G)T7mnqd@nHTAXP-IEeQVXRHpwWpOFsa+{61cO9t%!BF+e@Dc{rJ z>6`|7!c8-__bg{{TnU>BZlnY7uen2Rk46D;6`4S92_)|-2@4rTw1xgaN9}|Pp#U`J zsGrov?cC_v+g@Gzlp_n)aJP!quYGt<5%Yhl8)AyuKpsHl5x^kaI+*sv@G;=18w%p@ zO9viZ1K??%*Aa31<{R%#V;Y^>xpGp7x>k+eqP$MliCu*$NOtZaQg70DQ^9xeinr7+ z-I7v^OAf*>SutvGk);$AGCfNZT0wbRMa14-($z9Qeo3RFT+@Jg}k8oa?*V`AS68_!b9_(YtRub5~9#FF|=tgTdva9M42a$ zY(m(heRrgH=wFFZ>f~o1Zy%P1%T*WZD`94yC6q+o$7!k|IWbqo#7*)P6)XkGVh#ecJWb$(=-cH|NJWBmgb(56j6LJnS97WVBH{UT`Z4ob_opvtFoQHZBpo8Xt8h1et_Iu`}WnJLZ-|atIxqq zllEQc3KYMV44$ zFFvI?bwmMq$(7P9BNN2$3O_1qVCB33BxGKi1eQrLq{Id(pTVGh+0a7P&i+ft2VgG( zQa0WyHw)hCO%aqCGCjMujAl@Sv}Eai0Y583x8UBIqvWOvqYj)0|BC4N#OZWQD<$41 zk6=n~|M~PFug}0=xp}O_!_+64Z&vgAu3}`f-yIes5%0#}|Vvyx>+O@uw{_8K^sL zGveO3lw%kt?`r74>s58oLrI~13IB&Wo5)Tgli-OgTa;mi4W#$bqYVx-q(y3>dn@9fi`FI+cb+4B4REpD5g6lx_ZG6jaLMC8};@8w3;V= z%laNshTxmol2h(A@qwTucPsWFPp7-(jp>ucJgZ#!m1CBbh!2ML==f+0y%nQxS8*0V zybERfU|psws|)3%(K}Ic6nuE-r4KM;<*_Du^OUu~#Fo+gpw#K<7zR{VFM6xiR2}v*A&Y*_6?vUTb2YORa|x-UsKE97BH#- z+dC(R21B>qFtcJ%AeRJapRp+R?;kf<*YPiA4J!&WdSW~t^C!D3)mc<%6sDsnIO>}o z^c>^S)CC0*snzCr@yJxmF>g840EU*sh$x&t-r~7QndRWvtjai~= z>KjElxXHcRceue;bUuM&baJIIgjwZal|X5S>?n2pwct(H@J|XqdyC~DB#ZOh1uPssYW-1% zZ!I%=omi^GFUH1roM_q#>f1bW$GDT{IRHok`-nvMW`Sq@!BcQ1&O7t2L~9AYz_r*q zZ@|Jf8Y1vkYgA#V4NhoH0u5DMZzFvy3kE|Y8?ceSC02=_5hHUA&onDbW98wTXk zToXh(F>G@_lLLdd+Brhx6t|WrCmno|kzH9q8WBGtW6;0}M-#m%G;?!{(@kT9*ysVa z?J%=80n$7DI&mm+di58HHEuf{?TI*!844T_aFij6cE*Q4i1D|vSsJ!n3NoTkYu9@b z#RCc|5~lOGXPtZ@?*0J+aK6+XsdGCZB7^T=#Jpt_8T{6@F5Tu`oC+{ua{>o#zh9Mh zQHgoSzs$RRqrQ*m(F_SNqLx!Ye(v;7*&#FzmDCvp)=1`OCZXMvTfEUQHs z&I4iE-VoC@o=WY(aD7-tkaMTMx!a=;reF=6HHgy0`8STMk-7#>I&xP$$4!#s0f-^U z3#e$DDx2?=k!+rdXc1-sZ?&3#0=OHc1SG$Xw-|nUg)hh5*_9&K=Rcv7HsX)b1hXyQ z9=Dx~ANLXXs>}f+M@r3W!vkg!;Cn#DloAh5U9@eECq7sEFvlOM;ydiV?>)a1?&_JD zFK;~v?nCzeLRGc!M%pGTjGO~T` za3Ds!pyvr8B)pF$Yh-|d42^GBYr6 zMgiEp3Nif^dXkt%MA#|6tX{8<4!5>9NsWpPSn(!=E-fv&f|jc)XP{n5-gc;rPqWw5 zaU`Y$jUOV@;IBvADHJ<((-liy{(im)8=Sk~T2=7;gaVW*u~Wd0J>riN!cdj5cphQ| zNZMK6+$CCG#$P#2f~2_wu=??ex9IQRoF?3l|M%@n{7xLlzPTGT7KpZKsH6<_nv2?4 z>lIkQAp96N3Ea)M#+N75ZyHMEy3823HT2JULzO+|n9c%mQdm9^NsDgUX%c8Vr13bP zbC!n+V;v%KQQ(ZFGI+t<1^8c%lFPUE;Y|D^o`JA9nA=0526+A13W23$s6R0y-9!_X{l15s&o~>&PcWr5Q z0k0@kU+98hoM-d_C*c%$lbt038`*eAF??s+>gN^%jHcqW$$vDbC@xEaSwx+Itp}jt za*t>5SJ@5k>;*kNtuav4THU>St6Q7-nvvzYSOR;>p$B-V;shGv7wW_RRrWcskT#)I z4KEZm{Q?N>2KUf|y|&d3QznyTzf!E+j^{iC?sk4gk0Ia3dH@m{ z>3{}X8%V2huomsw1>v)+10W)jdNR$494@CF7pYPf1oM0?P^ zl@#aPI?3p|U%7~da}%5jkt6&pHd+m!D&w2R9CO*c^e05S&)e^K|b;}y}+x+4OPh*}Rpy?^c2`ZrWso>8|zuO^2#TE79 zOk5Dvp34`6zI6#Sp;Oi?U3&rbr}gLckqr!}jXt36U;#tj6abL^+`D+~M0-mTg5OOq zRlexyjqK`0hP*bst+>AC%n-tk!19chs-&v-F-U);@B?YVi z8jh~LlHK(s=>sYXjf6L+gGb%MZY>A!FoDuXn5N**-Qg3UA+goWmHb#fz}nf?E#ZS^ z;`dEi)(R9HPw_ZSd7bGFnrbfn8_s_?MecDbkHD*jgV{kq|7D}P^R^o^@RhndC!HsZ zI9i!5dYsW}Z+xMGUoSFnd78q2otYCbH?x}{t?QfAFdQ>%mp(7T`oXOjHgZFvAN7uO z{sF2t`HM7Ego_Bj$_cPnV>VkND2U?C)+=wvHvFeoImTcrqEs}gu~cUyD_g)&iZ0Ob z(K7HB%{Wvj*jM}t*Gw*2%78gQ`qJ6w>9F6!W4v6nTGSX|6up)xQfO_zKme5Gv4^Ll zH|ivjnCsRt=5_jflGf&zqmXH7{Ia^1fZt%x0w@a7;{b+!SzI>K3u3;znr2W4zXW2wSK7 zUZu94#%E)^inpkInO-Fy0BE{<;@SZydHLybie$=G7_C)0t6x`wn_$rqXTu;~w4M^k{X2t7N>M zmO(j!=J7bAE*4PvuTc3JghPasOJ+`))kQ)J6ti>>QbfMErBM%SsAGnjYZ;x6s)f4b1RK-%osbNlamk+vD2LxFob8keCx#DK%mbmoE-^TNI!$hW(i-;$r}!60Sf5#ZBxv%9C+tXlw5_U0+PX`kbFVA5?w;_DY z71?i;NV~HYJ~E-j6z0k$jl>G9S|w#Jc#yepcsNz%|AM2k{4Y2v6AK&5{}8DE;i#NU ztgQdD{C{y&W`_R{9M#iAC2Omd?N+J!KNQuCQfqsMRKT9R9o-EI#f?I(SsE_w2e6&SOESTx}2*cKy%vP7J+qqVR9647{DRH+40r2rJ)h*FLq#L z@R57b&JI$|pJ#r1c4BE}77W0-8DPALssSiDhlfxI8=xnNxF5zCwz&<2LlcNauoHk5 zmeK{*Ke{tAwGs@NnxsCotEnmJ>d!u5d44(35Ud+jeKi@#zc~YtNee5pZ;w?l7U5TI z8o)$0{MQXH{J_qQlv36~mRFS$u654}z#f1{pw^bHZ|N7ET?>M_AKwx|U0R(ytPcs` zyk@|_OpuLDUS3{}+Kru|Dya&!s_94no2ABP8z5J%))ioU3Dp?>vx?z17A@W}utv}i z{Cx-2zc7Y#d;#`(OLFtVt^~L^8G$?;=iAOGUdZ2Xtod&XXCDmcM;yy4e@nkNva&J~ z06ag27~>einI#M(Fu=f#-Ty+LX&in$@YXJnz`wYw>goQAZTmwTqKT$)2r&D8wv z3jX=~E*21XXUF?jG3!sKwIiu0D+YW0<}RGTza)nDoZYK!kGIL;Te_5_mWG_Rq+p^D z0+UA^jg(x%1xIs0Z@;hFZw%S5iUMNa>E~{s)yetM`T2(*C6z4}Ew#_e?%;O1^u+w=5H8W{i~ke>aS}fR6#~Wp++PQP zfZoZfbr13{Gi}c(cF#zH@Y^RRCmS$Tz>7V;FNPH2QQ*M^*wY`tK;13A?~Wh!FG7Li z9RQ8()L^_^xD|R*f8hj&HU|L6FN#p+x8yUAF#pf0ib>$LA|$D(hrJ)jIAXB$8YDCt zc<1IOB>;}#^G|i@j}3%teckJ??7}bO+~2pUMBUZV1z__pZFW?4a}?UR zz!I3P?YFDhZ|bC9VEj9ZTWceTz~)1`_{B?lL)aAE@pt{3uNQy@XS@5aBq+M1BZqi= z2DbmF4I4r!{Lcuc;@7ShfQ=Q`hVr9L=FiLew=v0)#jUa4q4~d2*?S=W+FAiS6xVT@ z{}uq=oCHBOfRCS68UWVSv2q6C(ES^@*YNeuj-nqs3VL_|S?_iWd*SH)WDnpEfa@fG z1ik=UTl^9L{bVn8!((s;Nxwnu0M?Fv2qdM4_#hD#&)|ZE<=@y3!iD9(g4qCUFZ>Wl z%13(=07#(k;DYHHKiCfT!hxBBV8D&!zczq~8GqOh0@pJ+*Mv16X~9`vZNmcx@-lF) z5SUSZ0&oD>=lBRhvybq{!NYCMA4&Uds}nmf_^$|`YJdLm-mCvT$kxAsh1AU6;2{dE z{@@`dPVT`S1>Ao%A>fBs5Sa}A=7qz#e>5qEH!k>>5h`c>mg5a(1;#P0;Tyk|d2M6= z5`KH~{%Av4Rxyt)kDlY09E2ac+PMX1RhYm2!UqOs{=}d19DMEK`c(f&&-_&`@L7-j zRVHrcqzI+hvFrTG2^D`io&-38YIFttQYV2mvDcR_@Zk_@Z~y!e0`dOapp?gN(GbF& zAIVX0@Y^r+?^&(tkB`%j3rW-8#^|3@$zKR-f32J=0WLQ5_*jNtItAcBEt?+Lb^KE@ zFi}|d?@QK#A4&d)6YBSnw$R|jR9Z+@6mI_@CBVkdc6^B-*VNe`e($N@UaeowN&M5P z{`=bs7(hU`z$wKOv)rjrL92!S;1qCiBpKDd1&Br*m44LuWM5d;T(&tenR`~BAS4nk zoGsjBA+4)YGAawyJHoX_p)_)&*`zo~svO4ziTp@gn5`YLz(6Gn6yOekFrgXoK zy;@$M-6gkqY%z~rflixjcF{;VLZ@l2YvT}1Iw(Ib!zA$pBpFe7tl-pgbxq=4!zND? zn5YsF0ZZz5h^jd+JrAlLbS<%dxcG3Tc!BYZSQd$@lnkW{=M=3DLCfvXP0~7=kCL%J z8~Yga&u>-fsDIFyU=g>@QnW?n9_I1QOGxSZN?5A*A^nkSdqSvwMb7hZO)c3325Jh=^sfU}|T-kyiqpG5!oqq>eAaNhdGOpCp~HHqtpa&NGq3Km3` zC`|cgS^{p0Iuw0$qrGhk2^)u{h~-097CRpj2HH(S?egHw8B(xO)vt`2&x-KN5^K?p zbDtXV{~H5xza82pK{^^s$%hbxK%;8bpr2H0kz9VKYu2{%Y}z?F{1|q+ol|{2FlsEO z-Xx`i=d~2s-6|_|Cu$~hGGc;Cow*kN99IejWyN(_rq{`@bc>9-qIhq42xNMm=58+^ zPvwW-gT?NrdZESKaN|fc?iQ=*qH;>x%*^0|GPqOe>zRlK=Aa_*UVI?f-hg;sLGfivwJ;^*uLAG>!E1#L-B}*yf6NdSMjC<##Lzo5Gz}^q-7KiT-H47OIYq zF8WfRdu39M<`7JH+Ra3JRJeV-|bhZWbVkn}lU2VZE z0Pa(iK6xVq8(zr=zcE%Ej!hZKz70N5X3k^{ zJT-Gx`Rt?uvNNNP?@~go?W&LJHX#yP7%Hd8X0VAz2`*Lbh<4x4nXn++A_t(wt#d4{ zEng>g9!0D9GVB>FgTBNE5obBePyqt1g%h);ZP>LH_-j!zpZ3da-rXfEd=bj^J(QTv zpL$ghd%;&B$QTu$ail=Vt3Lf=dPhRReB?*p^u*{1vh|Kl<*QzQRBu;` zQOa-xvHf7p4k3N7wfEM7(U9B!`kn{sy;)kG{$MkLIKmn+c6hgdO?g<2CiY3tPDr>O z_R~xMo@pIjw;s=@Iv+gss@|&1d9x^NX6R6mU*p>5qm(Rkyg|GJg&->!+l_qgu5NXC zR!0-h=ZP_31+ei)NdD{m1KZeEOL^1v)iRQ!&mDgd^qA&&)vqa&wfe0%G!%c#(kHQl z3_zz0eaXhzvhaoS?K4;E`_sE`hD9#<+6F}}{?|k@##tIU73tX-it4(M)mS7jVeTpH z3F!9CZH{r6wMF%xhB4!v+tX#s030~uyCxe^@M$G2Y~V{FIz@+4zr zaVwN3?CZ{f_MV$my+IKfJCzknXSk)$RkY@7HalarplX7~rIc7(U7_FQ_>W2GAB-Qw zcEF)zgPU*J)1 z!~CBP(YNEL7tOtp>(B@8oQX%kIv}xajfvN8z)S@?aN7|CXsxc{f;;MEm|Av)LoLOH zN&dd9TK(Ygnn%@kdLZ}{489D}`el?^&$2$(A698}DGtKKWeHO@vQ?Yo5R1F7%VukWSVta9T1;&d8EM`H*$8ZY1wb3}H_eA!U&i`C(af`~e>)Bq(r8hJs^tWhVdTwLttvuI?be3-? z_|&1kEE`Mv*z9i$V@&*!U9gcJfamz2)Mg{FU7G_kv}GD#%LC5+1oiRhbNQ!!_vk^7 znEfxFD;cbK1#pm8$e3P{@vg3N8AbjX%}8qbJX+>H32v2c_=$j+JlxxQNn21W*e6P< zCjRgz*(`SgQ6w-%SKJy2^3ax-7Qk_?9DG`2E4T-Zp<8#E@oi-?2n`B}Jq_vcxJRpEH^IMf+Z48_&r>S-w~13# zz=r_G;l}E|{hixe1Kz=&e>z14?+LzMbib6-|M8h$%n^`-q@$yn8Z|pj`&hEQWZ0hh ztPdMK0E3`twRQo|*3}#6t+9pP74cJdN)6CQ0iZtg3v1cuw+uKjm+%p1$t^qI@8;xq zjQD-cb262NJnK11$Z$x?y5yx8s&J$^BtUmwldD<0H#WJ<0D!cu)vw7GVF#OA9e1ah zXQy7To#eu90Y z=L88?G`llYUkyjv2G{2(^lc-B_ZYDeFovC0Z*X zppwzpB)IAZ3)4h2n->CyS(_qL8PH>0{*pIXPPT`v^=}sE^{XU-;zA|(4i@0LH8trO za^b3uaA445JdV*03G1{Cn3Zm#l@hdbA#K+=(CVWA;C)hnHe|7o67g)woHv6~Lu;Seex5Fk=irB>u^6#HM&=2*6X;tac0aR+EV;Ff^ z0uF14K_$0O;DfblkDI|j^43YGalhH)udRTS62<){_zNC<3)2E2A!8Onud&2bv>-{D zXS`qON~?Ka|5BCQC-*KDISZ?Ka(ul-;4t-?L<=;g4J_cvgi;@Ggx-o;^xBRGtqk&T z1k4cIW8&T`NM8%$$vj2=W3o)VJHN)tFmprt&AwgLU*Z|f4|I;L2P~FW2b9ECK6&)a zCh8i%DhlRBiv}#

fQ1-U>@cvw$D#CbO1FBJFT5TDe{F?|f%%4vMyUT%d^iOk>n8 zuv0!twN~fW5(3sDo)g|KunIDwci8^O2;CS&_~^K!<_M*Uuig!-n2#3abGt~yv*4Os zlA#gTN|4C89cUQf6dBY`%hs5Ejbp?If13!>y|j~irFj5e#LITw)EDZk_n<5%Ij@Xg z9*ErWV?c4E;!aZDVKhk4$g}2sq7AX}mf%_GcH}CiG}2|tC_8ow#7|1tJvR;luzweUItG+Ej{H00CA#ft#aV`rcV`QqS;X-SjS6Pl zk-E>eno+AVU!76*69R>My2zmt&;K%_{3Wxunb(_&%>m}fs07QWh*A-`%+@hUrIFMY2)vo{DJSsO@ z&LlC*dl`ffU4iz!TbO=R2(ILh-<1lojAKd)Fa%B72I+6ii8~SK_3m-kE)_X80 zP5K6RW^_x{O_rSZ*Q%>zvK3+}Q;?`hMBWO-8dlkzae^B|;xl9t0A;oM8C4h=dnq>v zVC%@X|3|3#ZF4-4cpOIB5a$!{1}{VE3U8%vu)q|O{gigaY^H(-zbM29oZ@KHhZjw6<-P?^LpAdUe{{ zsLJ`rnJ}AW)ciAM2%J$MP9V7472$t#)!#^(D||dZDL2-y)K|25R-TP<+&Ig|^uR{N zi5+#0uz2Sg5IQX*1!LR_7zkY@7Uf5Caq0*|X)44lf&ZX2=293r2&&|b{!u4p;Ln)H z@MbbFSA3-xx72meq{^(l;2zCts<&)--Wn~e^V4GCLAz9QXfjk4|rz>66&Eb~C^+e}$id|cG%`-k3d=w{HoVY_X5^g}M7%ywa97oKYcf zw!4sM5$F2MUL|YYi7%xUSC{>n`*vh_?Rn)jk$Cxu!*R@fg3lvC)p$L2EWS&qTYu7` zbEpr#0J=29=RehuRT23)L=#Q>Rv0E3L^gU`*)Fp+7^h5= zR|w}EU;p93t}t^iGq<^=+WkO1=&3}>*8Md?A$Bsf<_1TgzDH2S?rNf#7NfJzhJEPC zXcc-O$d?3@tAA~4l{8C~2|tBzkij%)t7oTx?xT)SP%Ix`gQM-5C+=uSl7oUieq;g_ zxAj}Hv+kVekmlmAb62YFZK;EXD#xS`h2B(fCnd}!JDmP+A)E@E)a8)AA6LOp{6%>X zC@gYvBG^j&op+ZOh?N`OEqPIMEnrBf9fxS^C9y{>xW92{ib_sr_QL+0&0U^>4ZwC8 z&Tbq-;jN>s5wfpiTOEe7$UX}Z>RIW@$u}1 zH~%(rt1Ki*2%as&zFdnOC$M|mDCXk) znaua#v7ZB@x1R=mt78~s5i9RA02rT4i`;rK%WZHkCM6|)sF<%Q**B>xTP$S5Vq&=r zx(s)oDvCTXB4I;bg{W~)71F~U1fRYL&}l=g>y(Y`qoAUxEC2&Bha#xv7t?yUI{)8= zEmRAuv)s+Ds(!N*ZX2RZZ1Z$LV*%{5<$u!b^?5K@TMY8oL#7Dor+SJ zd)$t=ZI_nrHnU9h|3tnRdMs5WxtB9O+R=1Jj$fV4IbFK^eaYu;kAo*3(C;zk?SfCt zk!V#n)3VuGA=r{kVX>fR9zYjr6chRK#BCd4KSt}?^>sFiDO|l)w)n)EteWpeUZbeQu<`XE(LW6Q z^zcoH+tXLkdhv~6vN}e96Aw%O+>;58^-5Uf*)T{bQ||C_1eM4po%jsPTPXd?(VS_O zegkTL4>TLjS@UL}TJnGgq?^F72z?)H_Gd~wYVczMT3xo>aF8G?dsb~W8U@k7UshiY zaLZ5R80a$d&CSAqmS%4rIs*iKZ;D?uTk~$m+T_m!QwOa%pkQOjMe8?;3K^o5uDm~klQisbNO^n)?hE`hZP1wy(gP3v`ooN(2J7vZV46*V|qxH#{TZ1L&uT6XR zy_Qt|JNrE;V=v*%YN3#rmxrnq(uB(aerM>rRB=|(b=x;(fwquCg-w|45>C85;Bq&X zD{=5NixUzh4J&Tb%@;-wbS|2?|NAKuRVGfVswCj++^i1~xlQdCc6dvDo5L!H2h`=g zr^5}ZJpE^`q4vghc8sXp&3e@|WTmbI95jrvt2>_-uF7#CzCX5V+D-p=keWP-HD2?c?E;M4pp-JmI?X z;3QOZaS|jYFX3Y@z23eV=mPQO>7vk-n4WN`{#sbXQC6ej-_7SL&&EYJ(C&6J+WPqBP z8ww7c$hny+1yr}57>f4w`MM#Idn4g;8Hd6ZQLhU}C-Dh;)X%~!f6u~I4c3`fTC>8h zuQQo9;QEW3-YJJn!^;zGHXyeqiJ4^$zA;n294>Rt3C9Fboc1TuZbsR$N@0!r=-H-QDoXbPL=GXxp z`oTNU3zvPYpF<)=mo#M6Vp(FV*_kr4yN{6=1;M*Pz%uXa_HVH96lb}sJ{GwbuHvbP)3pwPOF&MzCg$^meY)_MdVh z5w8A0Bac)vn9FUEDzqKT<+XMOE1+%Uk=J$&RIPOrhly&)TS?g zD${9p(j{vl&m+=%sFM1@OETp9C-8;sSrC({lK-pMi1`Y77J1sDmvA9?qCWg%c|(PW z`|l2>uWU#v7B)=2_RS<-!t!)aL2Yb^l6SBWQ(ix+6c`tUJ&StKG~QYn4%HOa=?ewx?Ib9vS&rDWQcw z&@UBx5oTM|RF5PgAG-x{J+IbgAnNmiNF~IZM%1aBGgx_W zT6_`v1Zx$t{tfMysux_`aW!;|SoiQN1GPgi2`Ty{a1RWr@3V9oc$s^HaU~T)A<0Z9 z3jXxKa-%Fh%u zwDF-Im5{!W?rnSArjV?2I>K|j-bpj~-?6V5h{Nmr&1$Et$@{xH=;&!F!jfxt54l(4 zY#5OaZ3v&&)oDgjq(9QTs_^nr5g;ZTYh>^S6wc@|3gsfa8ZT$Dh{C0EB5S|g=D>3kwmtX?DQCF^V2 zqS)(_+OUN>I5Fx$1z{`L*jx}4y_(}3W$rDfb|0sPkb;&EYAWxoeCTTDx2px51$SMlsLF@p$yXXXCa|ay~G8`0|YaG!oG^GQiVy6Y+X6aEf#iK!r zHBsrm;6pr;osjsjbM(tE$nC#i7c2yGjtP2G4B(G4D2p2G({ZI(X@hVhEv9^d*@gviw8UjECox8U*YK_ z&zia~%5S(G+b4z`W#?pGus{>dj8DcgsctU*AXJFS!r&OOkZ7Y)8FLk|Hq@=XJrxz| zXE7^>!-R<#eC-g&tLk~Win(fA)0%smkROGY0RF{%!G6SyFlH}2kU{f5#Eo17V-Anj z*OV0Z*dj5JcjjEt;wKsX7jkRmfSh*iAqX{*dxF|2bpRR|0eoGO;?t|mCs3!*O)=oY+k_dO!`UtzkBLfj2fXjEA zVBJt{2fJye9+Pq+LxQ*QJFN2QRXz1tU%X@9>x@5Jr)lt!`Qp6=8ofiQK!6X~!xII8 zQzj5G&aE=KO{;SJdnPh3T?9^Us;IYoN@%t-Zq9eLStf+t?CISSM89x{;P1Z77}iWl zjA6}cpI=$A+>uS7iTwlx>R8~}Gbda$3W#^~xk zb%|dD9R#$+rzjV@%>3hXB)vK}7(g^Ilxg^WB5_w2@>*>(uNNZSy38Y?g8C8&1maIj z;ad9njHA>7D`%-i!+6<*CI=GrnwOB>%}1QF26(EFQtS?qq*`Y&n8+VBsU|23i@&JP zi429_9YN8{+rl)#=@0&DHZl!|X2As6(c3!;eCpY_MBfgH1?l2qZWHCIQH+P)izmGE z`t=`}16yo9Nt1=NgiGMt#}lg7Ok0TicM>mtJoLbJlixU}?bM&|6)MKeKWmV+bB{sg zM7N@$i!c+FTnaC3rrx*JR_PioXVWwWmt|0t>%IhWBd9A+mC}v*n(hznO0TUyjnG3- zG`0(Pt%Zn{)dG%&NVC%>e-iRkJw$9Vi_Fpd+$4O}py$bueK#>1H1v=W^#I$*>JS6l z&H$I`$0I^+C%a{irs3yDKhV>-P;*sfln~G`-6ZwT;DM8A3;L%kVT~y^Sg!t)Oh@b{ z+`+@xFw)2%FG1`d;9!e{oTdi_pjQK(Vo$KHF;Gg-EKydkrO$pKA+wz;;*iUPPHEFg z?(zV0Nj|VmlT<5)=;8z&^wTriZU7iO7~P8W_{kHe2iQ-r3`=GSE$*GB|SR+<^_=>VXxu=})U3 zyUQRx)~t-Nz8(ZFOjnUVk-SvqWVc>z{a(BjRbSPvo5O!^V4e5AblsG*mMYS;F}&1R zq~nD%nM&_RHAV8KQ)|AhY4D5sn=^|_AHiR4T>N$zqu731EpvhlwYh%F^H9BI<0)6Y zpN#e>EPZ$mZMD)*Q(lY1(oWpjj>W3)+UXLNUFFAVnl}tZk+ZOC@9XvBV1Q9r2oUUo z&ig-R(jmlyw9r9(b!{qnjlNQ;G=(iPy;NVGm=!boCgW2A5mN}49>H%nr&^O54Z07< zWs@dw(kRlKK-i&zxR4XAuJ`tkWC8yNI6%k0PG0#Ix*vsy#mqgXJ9(zPBmZIs_AlZ1 zFW2S-u@kVa(^sgADt(Q6Xl$JWH^#$q2880fdZj}Fq|fV&I93JZe$(R#6GiVg?$LA{ zoB;=CEPUtInls#|Na2O3mgTARa{SD&o(Wb@M?H%Qok(dz3W?dlOJ7R2U|+xZfE&}& z%RU9fAx<4uWQ5E;r>}k#7%mm9?41yoSYZf|C`eLM_&W3CiN)waI!H|Iyxt=uIt>Fx z^>q7akK{CeH`Y~|K0&tq>2SauU$|#<>-(&Tr?}e>yzr<(L(E@ib7HV;AExqX{7jyo z&NopJo%otgNE2&?^8!5@l=o={DzV)O5#B192c+Gm8(34deNaUya3Op_*uy_01g+~B|se2(v6Q3Z;|-W z@s{|>L(D5{hha|A-6gu5WBVTG%iHEq3!|Dm<+@nmyl~>H`r`HT0OLGDB*Ki{h$gh zDY}N9te$X`ISR#}oZc6>Zc=NYvEI_&@6p)-ZtY)AnipR-!&?y2^=47Y98Egb_pS!| zdd`K{I}&KSX<|-h5Cn6&(Y2sWYlx*$>Sj!fW3&g7b8%OEMAe=!VrYrY!DF#UAB-bl zni!aA2qwBp02SH6W)#t`ee|_45+JpxsBWn3Fx)16#ng2agCSxwiGR{NaI0j~ZlJmd z>m5Raxbai+y_dlPhqjW78r-+c8BvyU)%UOBcn6S7T`!=ie&rGy$8jG@pD9PTnJDpk z2sBXGb_vg_^z-79!!Gd5#J%4a2?7Y3Reih&xn~)|!BrK0dE6d36VIpjY%*1?^_0Ri zU34FN5w>9(4Fx_BUaqv(zTog zCJ967=KdhsaPRn;@Af#QhVsIw>O*!gF3%Q1^I6!VkFJgr)hOZEt;meyr?PSUgf;5| zj+%ZeH>MuW8cif;5RxE>dPl>*V0(J2Id$5=DkN~dcK-M%4eM(E6W7wtFF=9QkzKDO z$X9pL4X2*5O8uouu(eX_Nd}3f1M28YZ`b9p6E)Yp{PsQSmsd&FM999 zc_V%WnHy>uSh#`cJg&o@S1wY=I6N8raGZqgNrjw%9r>x_b^Mr91eH=#yS=ibB?z4^ zMY+GSbJOfiwAncN=wV1efRj=eV)wq9qnA+N`>Vc2w_tt2a_>vJgG(hf7~|V7BeIMO zJ8%3^0AhZqAmJ$=LY>JBT&2BHalOR|~SOn?~w^Vi)Tvp~@a# zXhQ5$E`l3`yO_uw>T3!*6|e#nuG69-f(S8j^q|wJ<|SZ{pFfvHWg&*AbEqJt((>Y)H#_AA@5enS_rap1^xtS=meH&@qQ1vOtBKrU z&#px&G8!fK5L;W=JJ^9C;8E#ph?0}6U>4kQ=A!JWtA=yVqTh@Uk~OGwNl3d0GC`@6 z49Md9fg9W^v6{!-yN3!)C{ep>YG*9)crCEQKb8~qeH^8ln7=S2fIuLyDb#gUJdyIc zVYY7L6+cTYCn?hS`IT9>;lQDe0#@(2b!eV-g#8u*E%7Y>>g$=IbZ9?6thiXK8zP0C z0dIws_h_kbXhc@x+Ya7>?faKXF1_*Rx@e5X$B{$BMYf|qIcIpAjL7`)6B&ep%sie0zYEBDg+mH zW>|EF=j$ii({A(ba$~9SpR$tJv4t@j#ANA#_XAnU;m0--C6O`2g2kqp$^GNXA!iSL z-#bs&z1nIRXsM3gMn(2I-G>U1T~-W{DU#(jEn&^*WWK&5Dv-^A2JhmucwFY!+v^zpN-J9KsERVJ zb2SaatlLBct3An6>tK{7OYf|HvdfNq^^vLf&Ps{iX#!^TV=IRUs+as~4!z`KP$nE| zQp8z5Y4S~XJtc<;BD~%qwYHUGGMRG0iSXMKr6qcq@NoS^O$?LiBsMi>l#q;U#IdHL zK<9+)O1rG13qnkeUh6@m?jkZ%iN(=oEaB>47!TDHeUs_Y@bPtDLjpbZJ6f(W?RHk9FBGBKwDw;p?{1dCC>YU16{k))q1~V#Uu*H8Cp0Y2Fkp;f zs%W15@&zSie4*EOOyW(|Xj4z}tJELu{?tV;d>E%~(_#XE_}K+PdjdS(o1@U#nT3I5 zYt8)q#t0EK$?_nWL<5ow`XgVu28N_GiF=h@A&z-O10}wukd(QJ>Vg5{o|zR!sc&ko zFU+;^+(i7gst1R(WH$Ccab%6|gsaA;2paDti0yr|f(!LxQz)h1#E?p%_wSN?Tv4X< zS#`cN7n>9Jrz?8{1%BTvp=u&DJzYvH5B>5Q*C#CNT`_$~=BC_78AHPyspwwamWxAp zxWZc~CSsh8;#Bv%E9#W&cR;?Gb1{-{Uu9|8g zaocG*)@2GNKhjc@8njZMmb6Ld^>HdUw`~0UYp_2fTPCZ9l~C12L5Gm26f+*{hUI-A zJ^h%5QQ^dy`ti8}1r!wJ8^ol+-2`$}8x^vaMIRjds-h0JP1;Z3gG@gY_Dw)9u11!s zJV(tN|5d$%PK=TxXN6zd-QcbPJk>XKj+}ESfuur`Z+_H6{|ZQpDYddjvc2QE!Lfx4 zN|l7(*cbALW9a8!l=yXcsn56rU&MlxSY(4h6jGQ*W4Mr^&)$qH-(^eWo~z{!U%=~l z@C}#MQF6BzeWHd()(%;e4JW+7@NgzqZs5^p-PlfD>w{mr>*;6onrQkL+*iv_FC$qs zOH_J#Rj%F1c0a>T2UgzB_#*`Q+$!96Wq8wlDN)B8yrSunt^PL1@brn#L`dd{dpmZ7 zb&621*2~_VojgN)&5m<^G9z)Cr7kqo&f}9!rSLhqn$Rg4&l7%7_)6S7t23C#(cLSC zanN$y_Igh&8Vv(D>sm254Tj!nqOo7($AR3R zU%PtqmrvEtDKZ?XL-xO4u)RCd=a-4alpGTnEI&+0_^1kau6-)TIU}uQ|0&kDnQe{3 z=+{{}A`Dx3-);FtR=lk}-_OAQE2(u0k#~ud{?HsZ+Tz^R@#VlzieKDcm470Y?`S>a zyEt4OD+y`RmutP_?zJ&}D{e!wRA~yJf4F%Rvv$U!Kelsr9j#hF^6kI8w3dG|rlb_^ z3@v_weC-i)?}E143a5@g(+?m_SsNR}y|L9(66F*ifrCl^-Xh!qi(|?+Ni9oyyFKJ@ zA{v^MRqSG~z1aAF-Gv+{pbsNy8+dan2?Mg5|_>v_$(d_Y&89rHmED#{8d&|&~|E(kNXIuUJc<;3oX zwKke!7tK^CsaQZAZKbCGGFa53N8rOr$^kp^C016ssg)#$Q-_y30AT}y<`4duS+2FU z1imur)s>bHyjIU|Be5cYJ8AHN(0t}*S$p_r8h3rg=OGl2xD_}qMypQF+!F^Xe`UV)Mg)Usz0aZUXzrR`Wtab$|HYKNtmY)lJRktZE;mM@Ky5`ci{BbNI4-1jc0@$3+O#uexN z4w9eF-kyQ?>(XvvsbrWB*B$4WH#u|(kHTslM5S~I>?eua^zM8gH}v5Gq=uj4j)Egs zhuMup_4O&!eidt4nAC)a@~^oc&Q*)3POg7p*r;L1rd|4Jj>vc(THN!3t&b=(Agz&H zy3(|OR1eLV@xa!9{o=Fjp(zt5j9%3PWY0n#NON`Qdo69IT0>g>U@PUweG0`^PQMp7 zjq;Qx=AJP92+rmuLy=+ZT9r>$|4b?&Vtm(gtmE6>unBwH=zs+rI#=q3`NF#>t2h)2 zU5$DITU6CTnC76=c}`4id)NJ(#29yYMt%?gNxf%&^lH3fy%x~ zVVuf|g8I?6`KOoI^w|3w#4e9N~IqMRCg z2r>`yz5AcgFj7CjIu*vBo_kO!4B+lf}@-H678Sc@a2(IkCFV<5{2?*B zduIzUwj72F6OIzzvUJs-i0SOS1;=fVg9@AQxS{kKGQ<|RDKw5BdzdI#=C$Lr%0pVE zSLgLf5wY%zKL4y6bv$?sL!1NefnyIwMa6mlndZo{E^d_&dBD;_n#C{2Z^`d2 znr&y)hM1riA?+ko&($Zy_w8i=E|5uUzg7>34-||)R_SHW9p7Au`Z3`)y{K`Tr7exn zSV~39cTJzgY!kwafVm`=inG>UBswx2Hk{QPm4To=aa3yb=pja>t(sWbzYS<^NEBNbF5Y| zqaytpv;^@CFYk0EF+PzEB$UdaD=F_dq{E z!V+4>!t7xj@Ag~2XiTD#>YefST1SGe)Iu8J`v}IvOu5UH4b`Q)TH^=RXI|v%nxe1% z3i5_td(YcfxGu0WY7HD6UfQ*u{dFpdFS_}n=J&)l0*dYjt+YWzF^@dJ|$)kkdCV|f;iZ!OZ_W*eJ!rB{( zbN^tyQt`B=$Dvp&olLr%4ORI}YfXXNeFc_u$`gmKf8%B5*UL8jmHm5dd@m=H7_}mYOj*eqC6Y064AZRAbVlI&Ah%kPN6N+Rh zl^4Ia)fpt#Hgie03ozqn5%qc0+X~$0Mfbx?%uyJN{qZ2!wX_%QKsklo9(ZMDXWw#| zPXCLgySz@tGs`JxZ$%XQ!!!S@T{pGPcp;_1`=b0x>%*bT?!37o<=1uDuPgod-uYO5 zPkeo0vbA`k5-%LF;`$sb!u1lQX3#aPv*K@V zS3Y{}8MNAJOz8tIfD38!yuye1%;(j%eDiL5M$-LMe&ojkXUK`#f!hzX!w(6G;9k=1 z@#S_hj{X<<@+#rI;BcCGoi1)cV;_V;ER7FKY)J+02G!O%H-!w}?IPE6KI+gbS?qj@+ZUq2WscamAsfY|x3sSn z`eKvcs>;XnUR5yHhMeBsuDfoVwY!Z?If7r#}_xjxFIo6UqL-DErPDkJvK{4Loxd)7p*9 zSiWrQQNks(?};T&)2ZwlS7GSk=mEn|0N?&l-PF!qVM9s?t)Rk>G2Mz^^fhb`y?~M# zde0xe9@@q^RXv#W1yCNse`hxO+Ge&ziVO)o*k>!|X201E=aKr@1=*t1^Lw~ke+5|% zQ%q%AgsXMI_cBh$=YT+|O_R;M(?G~%KY)`hht#YOoB|$q{H+z28S%4H(_PWfu31+XHcqui* z8Cx?NO%zacc0pFzd1vyA;ox&z=A;AM%Bu*vCqF3i?guA{2Ob)SL!O4az>K6z{n`K@FyM|-g0Tx*ti(+GYYl1dK@A1v*pPL4 z-MaP{@bi=}H9K25!SpYeA~RtP<|Q9trY+sbQg92Us_4@bLMY4qDbk? zDolUID)f4xUhf%t(YEWl+)D??1gA^d&4zJg7fR!%8TQTtxA@-wy5~O=RqV;e$6nLn zAiV;6XM^SL6Lh4~v3#qY>WW+_t$bFM@Jp@pU>XvL(k|w!b0d7j9C{5gW#A9Zg7g}! zt~lusTh^#HKj_U|OFq>uSnMsK+!4xZ<}H1;pm~RX^GS_@Kru#k%g?{wUmdpHAnUZ= zn0nbfN50&`JDq@V;k`K=(Km(DPpIZubg(bSnl`jcB$Db%XZXVI?`60;U?NzK64M4~ z_gJV7G`A~t-;{eFW2lLv6Zx4Fe4@)TUt^d>M!ibhPjl{t3)&?c7A5CAM(tDmHN@Rx zYiycV>Scim$HATflh@P}MQ}jEclMM3&Xf`=JqeK7v!h?S5nlI##ab7)!zqq90{Kv} zZ@9%6XnJd<`i-nUe0rVP!6K>Gk(P>eldJ2>H`0^Zar0wlSZxU1a|lsh9U8{B{0 zU=wAMqFHlo?ucHITu{$}P~+#t$;Oz2m-X9V2$ez($V0EEwC?QoX3|j&V+Eo+-`?0^ zIQE6AmRl^4Jsol#9bkEV=T0~!_8VTj8#CWob$jwd*z4aO7aFg9<4-?4z{4m{?Xp~c z`_I3n^qhDXTd=iax8I=#GSFRjgwjoVcsoQ#qy3m?8~U?Tj%#jha>{$`?B(B&HQC7j zT4K{2G3?M8UBR#%3p337tM<|AJbVJULo#lcq}ug^p7pO<09l)Xf~Xv zrw&_kp>KdO9i24l=lRfvxL0fAX_3u>@6@iqR6IBJ#t^Nt$gG`}jj@?jKaFLXSyirJ zFUE<|8gPug2Q<)Av(`w|JW?A@Aj=h8BC>H*40^yb`({IT{8nEf3OIgxfkW24zz>J%y)3j#DH3uOW3XF}O@ zlogVi?LmSVgmB6)2vg#@o#uH;7@)|8Dnx$<%4#CK1!``LvR*>aj;Qs`SytNL%TMu< zlXM2Qf`0v5%BH|j`y&6NUb^RKtvknU<>>^rfoX~euJ=e=Fc5PF+!zyGUF2M!Yr@o) zWl8Ak6@tt)(Ym<+bSV>O(qIi6?5r|qwe~~Y#6>WV8p1r7g+OvnbO~JR5KV_!YVDM? z%(y2B3B;WD6#a=cozxBogj%CztNFaFEbJ(7-h8&dv1F(T;)FdM$VwjkU{2cD*5Yqt zpxoh*)E0UNl!i~=ez<-H?H|m-NSy96d*_Vzf`EDc0c-QIc&+qt(!=e;F4GeccB9`Y zoCG#nb!71549Qzb;;Wn{q|?p{i!j+M%ti|848ln5Kov1E&?G@}_aedaT+lvTVAu{# zE#kqYt#}LIpsjCH>fd4D9bv=%*ub$HAQXqx_)muK@@N_-B>toxqqYHpJ%ZtKqUsO_b0pO%T*Qk;!4;Y&us9LLoOH_J2CmD~v;BEc#>_=NPE4xHD` z(uk{Ti>-_4g+;2Ad?G%!ctTC~lJO$nLKDnp#9|5x<-@!Q?acLJ_D!~0ZbTgmKjTAek6vp|ahc6G^ee%J2gF?-Wq}=LnLE5u@53kHqCTeZt^I{*UPqA{BoGlM^Mb zl&|#%_L2@cyc06b^yIf+19D%0?YX-Acuu zJk6LwWE4Ub<&EwNzz=gkko&s$e8|s5`QRwd51#*W2-x(?zVotuEHBU8&#X;AaWwIU z5l{Z-HesUeRPX1cM9Z=CkNa$t?ie%&xE;kFMi?Qer!JNGnsc9bfG-KOh*z)|F68177>^n&NUg{eFQ(6+75Rje%m*h zHyOy^Vy`D%a0CaAe%rvHYz&~d;$kSZo9V@7))DHzJx$cVFCQoX*%RKD08rmB!+snZ zH}j>H5 z`1O;Q7D*wjyW#zU(m1w4pQ?g=n`2T<%z}apEH+6DPDU8~k z*hMllQ97l3gkZL>qSmKQfwo=M_Z4g)^FC(7SF7AS&$~w`C_ozxV<&$nlQ_h9(!|n2 z(Uua>9BnmS>xO)3C(UTTjN{(Zl*|#66k#U4z8x2u&s;<;ld8i>%%#~$m~xAAe9gRM zpcy0%?Rjqi73r0MHvzH@9pNHRo|OBxZ;6uc5g0QD9T}u7PU47PPoxJN(47CqT^h@Q6BjAKV8=_exnSYORW!BU8iMp zB4+7FP%iX;W++C?M8{M~S86R6$ADf%$LPb$hjcPGam2$u10V#~dv^A{rd_NnwCj=I zWABu^f}5~5Vv$v6a-J|C;8%EHx=iBci0iyT3{{MJ z6gn@!&TYpD?ekrzk>*jo zp_(hc33^KZfKu+1gf<$yVt+dpTCMv)9rUtt$e#~<$}NKVUxZD4ALNRHUrAPy{pMovGgcAYi2B=f}w4asra9;x!-r~sh%AbeIAEnUca4qSIv-WeS?kvX=`kdS z=q^_H8Vd#E|&HndztxDf(@B* z*0Q)KzLLw|HLRO%JVd8{x?F+V8!oBZ>D5$KUr6de5&jw7f$JzY8#!899 zjOqhf&C6P;03MK#x_s$W=A+v}4Lk;c@37_Kw1ZFS2_;?*I;$t)WsLQ+mk7r!UbN_@ zJj<87^t5{_Z@!&4b2H$B^2->XC|TtGD#mON7LMO?|IW;*W0=lnAMAF8b+q$1pTH;Y z7eWZ7w^D@cHNC*nSDZZw>2Dukwi+nok0J|M(($bL2*z2njam(WhT%;{yXop9C>I-B zz^N|@YN_n7XTWgmiC7t#QAM(t6VcHRKyT3LQaD4l|6^WDpKWw}z4E)!`BD%p6~)H6 z1}hVTa#ik12j)Kg>zIe;(%9D5Leadlm{>VnV%E zQJ>;E6tqf~D=7n@KtLGJtYLpz%S@L0dR$GH`Jg~Y%!Dp0h&Mz4^t?M?e@JLAct@;$pr^N=({s?<2-RZSGvGS(lfMkwC;0*MO|kMoaTxMMb)ug^ z7>qgUs_6px)R3*B_wd$UC|wTr^9E5*Z?0ZIU)+6BWVrT8l;chm8rfiXsm@#YW@rxh zf1Ms(^efnit%A#z_lR(rh`%dm4qmg)AlA#*Gsx{!B_q#`E5mw>G^dtytk{#zc5l{;gJF z*IW$Y6|ldqQ!{}w);dBSx@+HiL9ZE^fLhQCIvOr+Rx|FqAz&O2rs=5ETrzR9YFX-D z#J%xOYqV1*Y5ZCmrC<42L-P5&_Z+gWC1SrtFR%zhk1M`)kFMqw(^t9>tg8#R zp3yEi5=tC--n4C-sHwVgYfaYIdhS=n@Q+#7P&7N3^y#cioiARif$~OGQU>i??^+A* zOwYk5n)6pjz5O#o{|PKl&!65dTO=%mY<-&!ytt(UX}y@+sG7MruW;xg;_Zyx5@@j~ zD`;Vc-Jgx3f$dIok^}33>^u)CMSaB)(`sw(3H}fJzdXkTyK@G){V{7)2^yo|Rx%QV zZd=~D%E$_uyPu$?VScV$!xKh)?U=dPLT3&xeVJ=^VS}(vI*#QHzQw8CtAU&V(c%b% z|IMTX%vP^WX!LSPo*pHj9;JM=+u2>q@3zCiP?~MUa%;Q-n*KF%K&jl5>l_KS5oKjR z(y4D=W*gR!e+7h&Yf0-|ie704NISQQ(X8dn9=K^7%AQK)zIwBDlg-nfSTzHZtl_Mu zt^7PBP4Q8?17%7Umr1PbpmI?}enpNl?zj*kgg0JBWp{i~M}XUo;?@L2oL&8vH( zsVFhrNsZlFs6ne}!2m4Kh+#EALYMR<ja(CSl@CApc<>1^liHMN=L#`kR~n^;=1VU5dTLgQvC**JYF@C_w2w<;;f#1 z|6aq^P;cw5_tFR&`)hdjhS{~IDs$^+mI#RGHU=B|h~phR{5Yt$`J_$v+np0T~tO-^SJk^erWf3h-`-Ku34 z6@Qjr8*_!*$Z^yqwz%ch3j!5`YO=-c4=Uat`)e{ZcDOJ-OiU01NqzM1w9jXU zKFGaMDj#|}Ee6(qfj{Hm#Du zmNB9dO1g^E9omK&>nXV)^(E-kuECKK{cd)RO8`Hylgll z<1l2f4>@3_>5ohNY3Mnn=Q4@>WBBfH77^#6Q?Spa399}v9~~lvfXG5aj{@1s0Xcm& zm4%~+sxjB^cVm}6oF-|p8aBTsFJ0|sW7f#wWH@$EfFogUP^PMG`WcnWd)>RLOaz97 z)X@I&(9xXc4?$qD@r5g~EThJzlOGL0Zbp35SlHs`uMzfT1b^h=_R%b)zxZFwqyE_56+CBh-T&H7^`g;R}UQ9led zk6rsv1m_A)FZSKaWcIJ+>ak8+kEWYGfF7v1T=$s&a=*QUn8-R`f|9u{o$Q&}_=6+O zGm;bB$B2|)uX55y9a~!v(S>rNl|x+&deGZ&AS34{v{@=kpEx|fp~tu=IX*M6X@oM# zluOLu$7Enw05A?5JRUB=R}^v@-SOK$5K=r^7^KEu@%B`~1 z<=4*a)KLahBr7nhXU1jEduF%IYL%Uf9F6A)Mh4>zog6azgh(q%+$Pd%^^(W5?CnLxHFl6I< zpc5AV695$5oNo%5kovQj2?HK)OHa-UmL4ele6c`0N_6gKrYdV!Qi=0udXa5xul$PN zuG7E)-H$))OIy5p3^dUz&KzcXuN+V96C|56rJjJ^)9it*I`wwU+0k2Hlwd!5S_d=Q z^@;hhgwI>>*SXZ#)a`?2un*@`VR}|U$Y33wjepQ1gRDsw$|jzzfh;w;^k*vXmQh6Qx68)>`k1qWg4~T% zh=<^mPvj*Zbd=P^ldT^scXEjT@7~GT&=kc3*7JyXBSjVz%0g4R(n%KCdhC)Ek zf<_yP$~q!7EMPOOR$7CNkUNBA)dns~fi|%NscWXGQ{un`MkZDuH6?+r~5sR1n zBG~Djuy`^@vbXLT@PP15T4o|uWY_s9)LyWE+|d5V?jdYx75WG`vm`%5Y=n>HpSv(l z75v;lvvz{N4w-e7wfU!j0Ri=%hxrf#%;5 z^`0<+g^tD$IWSGD_ps$d9A1OI$H?CFc7+)M%mEd~1bI28GXq!!F0El-!qqll_EnVa zVmWKPZR6*^cQ!HHAkY^RL%1z-xO|$Xh*Fs<3<`aT1srMs;sokH`CAA70sSW_r5Dd; z0nT`LRIcV!+=m{1IXy7Ve8RjAV0Ywu9#gxD7`T?yvar$6*=a`PIzp$`im)=)V^F0! zC*ry^5>3uqsPu|I+ym1nZt&JosD-VCFayJU)|S9TfP1UyBXB7?}^#u_)*R+H=Jd^HUt^wp`X17ThBVdAF3Ye z;l){?#F59;2p;nnc}?6Ogu>;ZAB*BL9qRRN zA-(aD0h|S+IXKPfp)1aF3q&e3-lPX!Iv6ENb?Mof=HN#_+6a> zy;LvQqriwnvdVX?hULiSS4C!Ju@zPEf-A=*jJ#>Bta}|cLYac87;L4mAwC|}fFD84 z`N^d4C0GjF>La3I#*xET<+Ei}$7~VvU%DQof=8GM2JO`7&h#N@!F#bG6~o# zsF$bs;O+yYUC=@z0qy#2g4lw_gYeW>;GnqqO>PSOIEjT^h8BNgwVeuQ$vXPhS> zmhSs>GF*mbqGf@DoMzR*E$h7UBLK-_g$EJADrfwg&C>{Bh8Q|Knf`&9Vm0zjEYo+c zjvP)-2HJ&Ii>@s?M*R8<^b%q;<19lq#gRwbEaPOsC;}}=9wee%S#96eI+bs#uTK7Y zy-4lT^?@u$F!NMWm-Km`eK+ym_lO1?CQC%sp$LWBDB#V;r+^(d(By8eLZ*!7>E4&8 zI9YbwB`kE)pUb=!z>w%bqysQx15NGpitdw`wo(dY9p;$8b;4nJ+Jg6Vh_R0)C#DxxXx z?ap*JXpx+s(Z|{LdbH}~Y>6)UUauzH-It}vIE-(kSJk*qtbLRxdA=^X#56=U84Z8D zX#z`uiFT>RDp?h{#@^QYi7MH;vOF!O#g4L5XZMVh3-1ZHS8(bJStr{%s{ChUHr00= z^d4Qf7YNo{nD)A^5nQFP{TgOrj@HCB(qdP-3|!K`>{rUR9?+3Wx*fFQG~ddM`r1%= zmE&*O4VY^uTOn1No!79{u8pE%y+(e3dx9r7jZ_{))kva3%5{hDeWpsTN+@*&k)K3U zLucE;paw2$rys60?=^r4o2@~%b6o?gO*iZtIJ^e1Uiw$Zx9otb$(w0z(5WjF>3vj~ zi=iV%iFu**W?&4_p7NI?w*ra#@+Lu{sXwU^Qy(>F#*!AL{$90;xZgq>^aZSg!JS%G zX^wy;fH-R}sCIa9-n+rJh%$|nC?AmMmd)xxB;3WjfPz~5qCO0H!w04;IxeFDEEkI) z4lc>(FlE9%5M_VDEI6Op44k+O5@rSf4vc-?Xf`DG=p!u}`Bg2ZzwB*fWq1^$ZG!5F zqc1s<&W|A+&}Wl0LaUTJI#%{-t@@K@o_u*{8MCkTb5{w0T1}a_zgJ;gOpUWrIiWER|+mVh8G( z?q7@cP=FWU=6MSKyqNn~OfQ*n`MnNaByMtih5xIg1Ds{00OX#s*tud8Pc@q^*4X0u z`8eCs_unEW&}p^4vFGYWK0;eGBde@Tz-Q_#kb2GEyZFZ0&(M21}w~9+$+j zfG8bhSSX6 z9r=LcIfWzX++b`pd_)7b_I8>zCkR{)rFLl`^W^RXL){Uj_5|jtsW<7MIBf)aM@$~@ z-^9Az-y%@3>t;TAG2C4GVM}+9W z`o_R!D_p1LLlo=X&6E$J&$V=^@Y+2ttl%b$`;F+;2Uw=%BA`X&>euBh`gkx{;zz&u z%O$fK!Uib237$!c&Y21Bc}X0?48*pOEl7ucFfjMmiJ+2Re;#O2wQD%(u9ax^Qs$$~U+ZGpyH61sj7!7 z3jyA(>In8rOY0ixM4buW7zd0SXe`{wWtb~$G$jn0x_@SkJ|~`3Y(EBVXzW>IM|H5}A410)w%?1>cwvzZlch;YN5*|2qqb$C65`f>JT-6Vj;SQiQ%`Rb zm4S2;Dj73-M2iKy#C>sm)5#_8kE;sQ$@NsYYy2)esemwEW^sa<%KbfBEvvJSav3bjgbCKY1c|?xjuBl1o~BYNhBRk0rIvbtOaa74 z@;^#9zsh-hFE8I)Bmo^pH#6Z7-UqjX{=YgMS9&f5LW27pASg#N0G!}&oY01^kh4K@OjFh0Y=6F0(AwT01 z>h)DvEJwHZB5s<5sEg({DvRcIQV7GD)yVEte{>Wdwvb_3wmvyRg+`;8s2pDk{tbRa zcz^42(~^8$c*y)xIkJRE))fa-RM2>MFd8ics3uy@BKoaK`mK5VU6Vq+*Vxq->wZwNF27k*|_fTa)1!s)WW;a~eZkvh}J<|v;k z-CDojQxcgQSP|7^u>0AAmDl@3T(g#PZb7gj0DkSF_tC|cBpQ6-O`wWH^k3ADnB=se z6QLbQb4Y2_O?O;@!V=*zzRCKZvIGXEHec*+6$qLrywBKM6~fN;c;vmDeil(I5FAxs zW`wgUO(IfbDM3au8U8^ITTd>|I|%Sgw*JdG0?21AId+3vNqS>0;v@?n1v^sL4c=XX z9v-he0!)N=fUYeEKFp7#K&&-bp-12kSI7MS892!G#*hMvy;N-@=QP&3 zHqfQNk|oN(YViOjq9_!QUn^E9)DAmqnu)kDP+*#|2s{agb)c)dB?PMsyyz#)rC5iD zRKu_zwA#$Nf!$QMa4V8pzJyp`a#&Igs4=R~aaX`{ABU58tVyY33u%t2^$Y9LwX*~{ zDy#oTy%W{4ilqt(K-)j09h}Jw!rF)W$BMV`TyFF0h}Vc+>sGtRdPsfi9>30E#Ssr` zx~}shG*Uo_X2Fg$rjY9a5HLvv#kc21qKo1M>EGVc zSMNta+Pjzv{X1N$AZ5u?<2^up*q?K+{V)>G4#Us4G%S4SQ9*_$cd08W)STRS7!mL! zI(H&sdZF#bVs*%f`r-U|zpeDqufTjL1PLkC5d|^#Oft=c=*NHSsRN~PNa+V1!-ZAx z_F?1R+LszATCUokUg+90a%e4up>@ftkFe$IIGnc%b-sq}ei8hxh`7W}Q|X@p!|?p| z8#=?$9TBBM;#)gLF&9`OGPhQB#7;9E6ZE5^d(GU4)*8o}3lQq#0yN&xdfiLy!uTBFho%^W#Pbz{4K#K#1E!w9ppdsTuHfFU{`W*;3 z6a_?lnlnLNla+eE?z)rd)HytKe!ysj(?yj=%AfxGWS=B`c#id?4sqn|ahe&><2I;U zbK#snF1=Fqi{IqWOCst=;IjQ6PcV(zH=!-;@-imeBo+JeY^l5~6(4yi)7Y9TMF^Y? zfjiYgtlWAdEIDqbU?HEvu^)0{rM|th8ifSZ!R|T%FAar@bO3|BzSCs%i4R0Og(Kq^ zrAxO=tGv&QsJLWO5~;xR{yr$vls-PcOM#3m3;+J98uOTBsc^K?t?S9 zGq^+W;2PZBgS$Jy-Q9w_ySoPW;1b;4WcTf>t=g^n|JT+1_3bmK`*xkKug=#~x9^Fm zzk(k(G@Q4BBXFgxc)z*0U&cj+0u{2hRA)!I%5HckY9PO-u-ggs=4-u~S5pj_%V(u& zM>@)7NVgPV=HvR#3kdZ7q?w}Damg>tn?)bw#3C#~;VI1=&fQi{VK&!~7|v@PQ7}ar zlzqfrop?w~yH@+2AAnX!S8kQJ!3fOYxkLYrcC#J&jxgW@Pbh6_dQ_5$FoHkZHdssJ zvK8Ktm4q^&E!ENzyab{2uyq`xEdU|_wlfo8UrVuEH3bnKGM-X*i8m~ge%)B+Ub(lAPHToFF7dB&`WHIRXY`_9 zyB}a5lvlHJKnymyH;ibq*pLJ;D=Ht1<3cLA$2wfsh}|zq#?hPEH*-EJuI}p4q7ypo z6EC>s-~F^9VjbutV-^hAu02D1X6|o~0u}V?^OPnNpk=B>P=B1rP+mzAVK_d!{o|}& z{MtrLh%*?g1SBi6w&ed&n+z(dkq@E7v@N_fl^8-K#MY(ZIJp-T^v|#F*{*+k*82I! zy@Sp3dzDG)!4>xk!>~e@tH%~p&trcVM@_Yg8A+BeeVW_)@1eVhzA3&@<9d5WL>xwA zcc^{MK*2P%mM#Me{r5esTlu$l&5v2#giui2tcE}` zX)h5c8=c?R-siQ##ft*YVJg~BifCf^c=?hb4>DAHT}ppuN8lD7lQh7SAcv1i6G(BR zJsq_AWzKL>di_bU+nn&RKRORSK*zK^Ce}UOY|G5RQ<2e@x!6~1 zjLU#4ioQ(?_`-UkZZjvb@LN88olIUKph&NfVf~w-|3}Nju`eqRCv-9}KV21ximE=? z{rRYHT~I^gSAv@{%gSAMqx9o7q{19k^C_?4<*i_XyoDf0n4W7=4uI!@Q?JMIl@h{Q zOTYnMLxpkYQ{7laA|pmI^BN@F>f>u;i^fE4T!W4et1j}sD&|UV4Z^0Reo4c3 zoU6O@p?SzYRm_vxK~2v7P+LE*BUHZ=MT~pgSr#lX{qb^MB`;L$9 zGD-b15Wj$qO8tl{Ep_XkwxX5ubH*8Ya+qq6hk`CfwA4@m#|>AF^c_eHnW5#qqLFX(uo@#j4vAg z+!GX~%fiy=4~Rt9IWUt#(N@DvB-n;{4&1}X%-&>b=b(nWJeWWMo+K*imH`7EyFioY z?PQUNQWi=NctU2Yg$OO1(q`z+gBwi)-Z#d0HjhcUdB+1*ROo|CM$!bR1QT8|`E|*# z^!f*1mxcvfKebm(Utvox4>hrlcdNOc853St5$bQg@Gk1m?3pyG^;DW}5CvC=V_X7A zqB`=MM^3q-4ZOj%?r+F>ZV#*qoD-n%aNh#|r#`XAVYRM!m!$3r4AlvUn`Y8=13EAG zz>m~%_x^hhFCZKUGj47ow-qBzFNdi}g}&d==bbrNKCv20JUQ;I63M_4xzUD#?Xln% z-RmFgi8+!rH_m;z+VL-Zll- zTF3j^F}-MnS0`VWY;sJ|W4tRCJ)Q0}c7yU%#io-@WX@X|qMJ5kT#Yh20AWlG6@uNi zMZYu$5}B063hb*K=bnZ;Jw zltJa}F?6w<9&taTgC7sA@pAM=Ya&ykfbZW7wbb$qx$89PGWpO8d+lB^eh9 zoP1x0ZeVpJpg&mi_=kS*zGoRB4eGQDR(&#$Oersa{?jcIX#VAAye?_keR0ADr0)$HK;U6Wqc9Lv)08(h_swv>8uyjeZ9YUJDAUs z=?>@c9)ZJVDcCVglaZ*p=at z%~34U&XKk+EOe!Gh~}S~zWyjg!VhOovK?Ni;n@V5E9A08KyCuY`_FQazerSkAo(qE z#$He;nKScXpvgd#bDfglh@{$M+0b}nj5=^htln&z}^ovRs;p02~^qVL^JBB^)|Vg_(Nfa+()tmUS_d*Cg&`~bc?_t z2Rd&rej$Jm<+WfP2TgKo*Zf}3E$^30Lve(` z*Z%HYH!gqgHo(slCTY%GdociScecFfP{AU9gruK#vXV*)xwn3G`APJx{$NIg(gN?Z zNxjhu$0g8FKGeJ1`yA(w+vOt`75OQ{@x<51IN(XGw8IAi*{++|Y*+39bkme&M}jz$ zsw$#&4BEu8(_>{O0mzw0Yx_LD!e-50Wd`=M8vxpEBDeEn)mr3J>_9#K)h7daX_AA=;0js7DdL4%EWW)u z*KW|0K))Xdf}oez?!vbhB{NfHJe`rV);JcLriX)}&!74y|! zX-m2M2&>Eu$YPyrQ!lKDG2k4FC|oxcDg(0rc)L1Ni05{%+=P5@J9Lb}C&W6As1EyZljOVRWGokm7ZkKDnziCyy^ z&WGR)5LLXDZ--+ zGoXOLSOhT!uIEyO{l;S1o)PloT&M!F&wKsdhBr}#^kt(&ViS==QLokm&ZF9D$ZUFV z+zJ#GHNQlTQKi34{^c0u0?@$J>JPm|`IA!}@8Q)eqcv*#Rjcjva|+|r-VQeM>edg) z(q%7DU=rDPLV}A}d+w^6=WM33xX_lS#~>?Wdu0OqMF2JAi3_3x1b(xSg9^Qf_C){k zAjF9d3w3XGz_0;o$ZVNm7O9uRYFx$i1Gs-ZA?kv*v_tON_IqZ@2j}wBSd2G3%~!I{AwNJ1 zz6tdb&xj%$2!mLRc(lcYR^3BZsM=^7$rX33nis{P!QgIUW}fP;&v%y-{C&Qs$#K_94<*w-2FrxC^e}V-n=4%I%&9| zG5SDk^uAx|xTA1x|8`q?R-K#mWoFdVPu^Vt__Q zIaw1yP>)%7R-fL7x-T?#r#kU^)uyTjCsIEAr%y-c_f-z)fn(X&f-DDs29fJ&RiWPB zT|(LS6Y4UFSHQ2G{#06mNm%nu5Zpv1^iw6F;wbvm!Q7|7vmNBJy27ba7DMcO;82*z zvL9JcsLuUx=H8GiItHTZko-|gTTXz24(H=_%sa}$hQNO-BFyn0MT7w$b}qm_7X(Ie zb1Nrf2LPkEmA;d)sIj4~kud^4KZ2u^gR#Cfg6pcaimdGp6Kcnynk!1;>c;`OVYQ&x zjKD7hE?S8e4QgLxu;D*r2(G?8-=?S5x2d4`fd{c%rDG3uKHk-5EjMnqN;*|9)bZRe z<=V$zKObPMttops^0l5{_q8lG?4CXf@qO~DbHR4jswz`y>RNNVF0({lT8yfwmJIS( z|4xiik#D2BRvl(Z8M1p91OG}4gLj(zy1%8Ni=GfuCa)l7F0gL5alanGa<6GgtnOmH zLOe6Vl@gA}PI!BE_l+l%ydkK&JFsk>6B~EN*6~jg!0pwgvw64d*KLnyRne69wI}Yz zm;G-{UIdz)r)_G^&ywlcTDtPZ0RhN`0RzGfHiy#gDBf1lapU*lvpZN$A?nmg^#99d>qb16j#;=QqCFd?0x65grGqvIvjEh2s|OE7?O`BQ?Ozi+q5IgITPsM^_3j>qmjZ z9=@2Nm~R%9(7Feqc#Hs9wavUr{5)=IxH)k#0l;#zr>^+4Z2%KF=&VlJjkb+Id4y|E zmo&pXM7tGO11-nJ2|Wptd&zLL)El!0RU+yngD8d&x57KbWzp>!c7oY40+n{m+=Wgp zq2!U4Vsd)yUdej`>>rtnIKO#CmQG$1zYUkZ$M|9Qw%K9L>gW-|tckLU3zVM1xF*Cc z4Qid#=p$9v#WL3$qZ!UmnM5Xc?=_Gx)&?>kn*t4YjMfC$jCPGi=O7Y!fL2$)0mwODg*{wxlUqwrZfc3ax-p+rsM100v$Ab{5KpIvp;9wQuUi= z0@)5LSO<6}HB3cYDp>Qh7tN3Q46^CnYj#-O>vtTH==jkkJO?#ZJqJ7McUTu2c35A) zd;5OVKQ;rVM@>7dPMTM^9a>koof=oE#}QN0wqP?|ziEY5=+qYW4(mhE0}@}*gG>wf zdW`+1I}ua;Qx8S}I#V=vV872FaK+x`ke~;7_D8K6<@m06|Krg2E#-DqBx0KAAv7Wa zSr&moxs@~AHhe(}~M19&W4=2TiQ$+ryYTdAEi6Dm!^a~{q`Ab#Y z)w43#>6?d%hl##rH36b*(?&v_b-LDt?Z4As3!TwVJQTM#THoURo#$^Op7|UntbA~y?rilwT}yx8 z@R)778HHU~%E>Fxwc&LZ@hKj!TwH}+^wL6xt+z)wAQhXSvTv)hm6*xb$z@>5wP-t_ zBXj43y@36Zy(w?{efduti8{l=%ZPl|#lm>_iwj7VJcLbA_)V|rh)J5&^R-oKekZ@)4`lj* z)oo#IciP{c)k^X`G3h@2ko^LwM2bolYX`XYn}#6EX&B;WS0@uyDy@3dS@RA!5S40boS^U+i$Mutp4*oT% zG)}&E5(}l=;H!uTrOtHgUXVt2a)ln!gvJOCkuj!Wnfz-yjt6Nuu88U6lVL@A-5btJ z#_b%@sD=*`AJ>UIb@DCM5ufQR@Jho+*mE&e*m06u$#XZuSv+Bkn{3oLGuWH%ZY2+! zmOo#DmVc=GAEc}>9xy$XrsX&MM$6wPPy3g<;uor(!QQm|BE^{5Gi{jJ+v%9urJWB* zuVC|}u;*Yan4sPy`~kY5Q`1VJQwiYf-3xnCc0S0wM9RbTI$-4{3%^J(20g@hsfWHR z$G>Arvp)&|OQIi8=2Q^U&cX*cp;yv~pS^^Sb~?ce#$l zRNhN%F^I)(gtz9YtGCai2qd37#&NY=ha#Ej+s2i8jW01$@?CTsJ0XTRj}>E_K4OIh_tW z9qLXG&%rr$@0&A!tBa>omCNn^#U&u?xoWjlVa+7nbB<#-lgf7W*sf|l=+@_l^M$NO zVNOi5mvfitZjMQaVRZqqIjNA3^@N_(JyuYkx$Fz~dckGY&1@BH}m0Ockpiip>r{X-RKq6vV-o95kXTo*Gk@3CG$Zh!CW=m{XhlX za;PVjUYsnm#nKwAiAA%cnWORs(hPfoliAZGaoKN*)zA(ic1|U^tm9nRw|)M*(~W`N zgxHrBsjT9!#y##Bb|auapr@gmP>@0(H}hkK!Gq!60%v}eH=ft++9mE6`vht#gP5GY z{=mmbTz;Kt>t!-{Ldp(zqtn}d-0L7Owsu(Hq8`HFCF%C-m$-DgW^6W!3X*|0iUJ{p z&A|H&VJ%^A;B4yNxZef*g6@wP;Du~>M@@x*s4UMq;ZaS+W-pJWn=RbUn}Utpq2j@1 zm67(~TvQIeFHeNp!HrPKq}RnJf_wN8WcPxt=KZ5)W-B;n<(r;gw;J3j!}vgwYkI2u zCl}b*|BVY@9c+!94Z(7f+Rn&C*;t*PnSqUofrTB2z$oPGWM=C?4P;a{cd{}D@0FYl zEQ}4Ez%3E5)N}&>rp?O90p2*ZnV6Vl{(}BrQ2qRzhM3s{^9j6 zv)%u**!*o7{RP{tN(a4ZfX!^l5HuBMMf%0BTWg7A6)pCN3rpW)^lX zHYR!|PAVoQDzKldt2#meBfEj>ZU#UzNqw8N{8ftPJ#RZ2rbV+05J# z0ABwY6FPvZv4bPHYXB<)GaEBI7YheFD?JMX$G?IFKEYTWAY<-g3;^F1E2F5ffw{g7 z5MXJi?_}m^tPgPczvpu>urh!{3s&pKHb%Av76>f=_9&Fy?2G}7@&*=4PSyyFasXD& zzgpkH(Fwr90dC3|+n73;0hrmr&+i}0|NJJ6dg4ub&lN6tu)nP9KTOlk6$pm zTOLrT$TKs8x=pu;O6=gy6kD`7lmSgmW= z*2j2W482?YsxY$+yHEh6l%1g4=6c0Dz)7&7>eC80Lq`bLmzQ<6MC~)MDcYCPL-7uN$K(cTyit<65e~{^d zGw&ErP+|06<;{0r_a!2ch`e?WB$u9{RE8}pWNQjO4m;4>F0Qk%(_uUPEXmJn@RHYT zsZML14{|v+m`+&OOO`qRd^idI8vN~*lcXOX?}=^P%*?EL#Y;2Loq8uDI6U~O=%;OO zD8*J*uzP8K&h{Oy*wa9xj{8O3^oceeLBW6kvf%+%J}O*EPVBL^iIZlJZ$2UCM+^Or zL~IJPUn=NVEhA|y(!>@JMd36yq63$gBaYaLJCfhZ*>T@`;3QZZa@5T*{Hzq?j@eG_ zgu2O0F-iRWE&}J|OnzuchW4M?HSNtdJk!%(H_kSYMdM_v@^8~YtI{uMwQ_&?Gdkmv zgj1=GB2<8D$kN7aJ69`eMJ`?0J5$talZQ@PsBx&EBQ?Bg;{cf04PCwBHfnwk8j=&t zVS382>YcC(fRYs?U1xSR3A~XO^qmRKL;RX}Ahhc=6#A3{U<0AsbD=77FUZXu`BUK+ z=21WF(_tEj)?@5aXQE1V4(dw2Sq9JsX-MGYQ94Oan9C$+KV#MO7I_&;MuZWV1ciH$ z{GLpnb+?p9{H7iv#+Q;fSCm$Wc=fq(YtTxB}vV5&~oGQdEIDE0iH{3|vk`Bc^=mb|ChK8RiXN0Y^bR?T_2x*JWF-CU(d>3(anhi5kN zrHHFX|K_!!cdx*uk~fxD*N&R2b9gYyLk^xAdSp|R7r*G&sz>LM>KU9%v=7SHU52rg zo{Dk^{bjwNeWhc#!2#!Z6V}~a;FoN-1&qFO`!yZ9&!kne2?jdJ-$OkdS!oeA zUP>M^d=3R3)>SFT`()Zvk;?SBa7p<0Z=E3#OCnkl&T;pT_P21W0-R}yZayyj==Ia7 zxeZI@<(3ntbOZ{5qW3SKIVvRnVfXgq)QjO=HB&s*#<)(t9G7+D6y_J zH!UL3b8!DIme08C zp{4Zgon6ItS*mL#wjaVjea+7ggp&jAMFAl2?fuzPn?cBW9Td%0^`pNwMO-9u>R{qI5 zO7tmz7WDV(7Wrr(3Xuxj_Re^Zc5lnXGnIWJ7b7BPzQUN1y`<&;_dax582D0RHBpo>b z3OM3NgHq;-4)^Z$@Z=;wDA|<*@Ubys!c)cb5cK~%nv&yBQ&f^0jeB55DuHNSF)b8M z!4f$$0LS+yGh?^fPkzZYq*Gw;P|W*mS4S`E6M|LE%`?C!=IP^@(5L_MuHopU@8IO> U01hWRh?#>8fr3IzUL4{70B%nLZ~y=R literal 0 HcmV?d00001 diff --git a/examples/notebooks/intro/input/solar-system/mars.md b/examples/notebooks/intro/input/solar-system/mars.md new file mode 100644 index 000000000..f28fc1a30 --- /dev/null +++ b/examples/notebooks/intro/input/solar-system/mars.md @@ -0,0 +1,17 @@ +# Mars + +## Solar System + +Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun. + +For more details about the Solar system see Chapter 1. + +## Mars + +Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface. + +Basic facts about Mars: + +- Distance from the Sun: Average of 228 million kilometers (142 million miles) +- Rotation Period: 24.6 hours (one Martian day - called a "sol") +- Moons: Two small moons, Phobos and Deimos. \ No newline at end of file diff --git a/examples/notebooks/intro/input/solar-system/mars.pdf b/examples/notebooks/intro/input/solar-system/mars.pdf new file mode 100644 index 0000000000000000000000000000000000000000..a48c4365b376f23dd642190daf000c5e35027dc8 GIT binary patch literal 57872 zcmagDL%1jmtYo`w+qP}nwr$%s&bDpawr$(C)&G0lchrNdR>eVh3~kZn7EhE}4B-Ea|DiP{J zdo>%>sWXa}17H)czney)VbBV*p`}^S)Yy(p21v)|)umKQ$U)AWT`=ffQ+^3zwO^v{i2l?mt@J09z7a> zAtSSB^X4DJH)r~JTLTI1?Qzg`3c1EG%r4yBt%5*kZgP5pr)5z_E{WFT_oLMAI}L|v zo(-Kruxcv81LvCOQ{bmU_T-BQG`dw2=Il!pd{v^BOu31jHA&f<*@pWR{bOt?(LSYy zSNjdLDU#92vN`!m;b3(0aisE#&yH}e+mhozj# zi&yG?F1@qmI3RP~X>P)R6<|1p;$88`_T(E?BW8f!*6tgC;NxFjM zkljXsal^2ljJN)68O{+N10Ug49K|$2pMVakoud*}pIVRvzyheU(PwH7eqjmXqvZWN zc1?mBwXeSeGvrTuoyL;V?OGx6gS(h8Wi=)Rw4+9BEH_`)k;}-&o>7y|psKEQmEMfL zeXU21+Zrw*kG$?xoIS`5hX2|J{)KrI^hPY+8dt;;Ym^4bYuEJt*Z???rkpL5(MO$A zh6m4&2@M>m(+>29Ns<6H+7`dhRI?C0h2g)x_YKP8hgS$?YG?9)sQgd;e?&7g{aq#Fi!oAg1NzWv0Q*$mdnqTbLfzlGsBWhcs|z0UgKU z@&nrh#<;lPG?)M=6tK?!DO-Rr*awRThh!Q9V^ zW$g#BsI<3)sb>VE09*l>j2eWB0#F3iH02bP+MFAXl51^cd;MF9&M!=Cq+k_@&=-@$ z08l9ai=d*W`tel-((-?`7lMsf{A~WKdAfLK*U{35)R)wiOAoxg;{?C}m=*8_Z|XDs zD`J0~n+5n=+B7e3=!sq72L;%Z!Ns-Zyg1j<(Xo)t{%;c7Ygr3h+P@3=iREd)eIsLg zz;~WaVA{YxfFl#zXUMW5%6omt-+v$kre+Y$ZUDcqNDUup3zx`~bNTW;M}OAW$PittI~2CVu}?KJHdQZTRM7rWS{!_fGi9OYLut_kZozzwegVnj730oE===)x`+< z*Qb%*+q}d3n3suWgUf<Y znT_|)09S_u_b?$Gn%#lCJiogR{H5z*YygOvg~#L*{pnEF{TrjTv9|y?eoKC@kNaf* zSby#RXp|=BP_0aiZ7;wWfUt%ZC)PRq5)Zci{8!igmysMBo0AvXfc(Dp0)ETVdQ&n= zw*Fqf;L%n0_tg4+@CVG`h|FpQh&h?Xg_rybKIwbfi+;&3K5ArX0`rhR#7n;2aW(yr zUY(a1zq-r;Eil$KeurKsM1`TP^Mio{QeN2Z8Pl-isR%=U?y>0nROIY;pL70b1q$=>ZsH zd$0Een-~MnOa7_;5H~Obo}>Kb4L?{e{rMvS^8yFJz1tIU2h2Y1iNFEmto%j5g$lW;@-^auv+v_Og>Ql`*^=|-~6ir*!Zii^N&VDsCRtHH#9Q_oo8rt2EfqZ zBQ!QJFnapbZwUL_tp3#`$V>XR|G1L_1OU+@kAhqfsm=4k0j`mT;%9tUEi*<;s2q9(VGJ^pZe z=_%=PL=62HYgDyJ`C%?j^O@p`q>LkJd|HmqFlW*_(t-q-G#(4WxzUwZj6ueySeqEu zGOe-hK%fhChM75kO)amL9@pn~NGs(Mz~RrGHzs=V{z$7ZdwC3E93vsH_3n&cd9ar_ zlz>R1rdR&#q4W22liUAHfc8{3#CUm5p4VkLxG+*#ok*w@l9ab0r8K3dZ>i$^aNw&hp#@B`?py7MJGg-hRpA`g8|gFR8f&~fb5P|MM;;9RH_ z9e}0aQr6u3B(sVB+Vm-YE?|XbJ_oDx{*dz$4eO0`LG*ztmm_%VhFNwuLhs(G*+*+) z;*SB0{=t0t$6ZTQZAWJoY52iHpG#)0qjMt26X4uQy*eVqStu;bfF>aO=|zref$mvq zuW~WPBo~ixAjCJywK~$xc*rDP@k}0b&60Y7)8l$?S)fQ({*`_dw3KdehaM}NJSbma zctRE!+arZ+v515{Zm-ux_`o-i_E0fVj9=OH3$$_GpmV)@L0g+k4N9&P&t8}v_Tr(D zY7j>Z6FL+7lbgjI-}{OV1wk|U>;ypA%WhDvb@gCc;KY^ra!fAH))W+#!_KM*0jJi_sXg?>l zxG8Nzp{ds?U-&IMuO9k@Tz<4}Tm-5__*8-O70Zw`QQN`g+;eCK(B0sd`#y^Bp(3T#;D{a3PU2 ze?;KPB_iJNcYJ-x_Z&V0wtq(H7ivw#yWq5b?r2UH* zi6v${izIteMT}S#q~b+aR>_LUOx6MLQk!9lhYz>znT!39=t{Hbp6`T0N-d}>rdpa9 z#@ppr+l@{|%6Rt^I+XObV~*Ol_ro;@y0)UU`As`B zp})A?>G`Dh0cw!5(p>FwbAe$k_vAGtng8e9_ZT*4t~Oe1DWb^@X$R%hrRy2g|<4ay1W&R#R@)1ef@Z>v{l;_H_Cbg)%$)NP%j zDwwkUOY}JIOjf1Sp&~d*JnjSlgaZS3Lq?H9Z;~P=Rq6*wRDdbjp4Tj>vtr{Z2UB&T zsbT->Fz27I5z0)!Xcf*Wbc@Ez905*Zh~S^uKjeXhl3sSckxtKBaul@9JJQ?!v0)ev z_G@GBx}NNET6k`?84NHJuorLJ&qJSUJcwxTS4V-Y$wb(PjkqX3Y;qF5olBEPi#?$9 zhx;0A=%q%9os1fOw<>0^gR=S472-4{;%T2PQdU%6>y(*VKsYa%3zMA%o%`_8Vu^?Z zaePS4M19!6QKS>I+4kEEJW1SSR!m3-h~G8e4a^-{CaNDr<6-HvqSP=Q@P$h{a-M>5 zevor;XCf@UefFp`5sd^(ktt*CbEpVVWdCJ+FYQ;+oR)>}OBYAk`g-5p(wd#fBd6AB z>l<&iRV&>Nye|9MAs<q?Kd=_Ki#zRn8-4DB28-!T@f|OHGzjcNa@vR5k0p)J1Q9xd9?6s z)KSEhEhbp<-<<925qj)7&S96%tz;ODdTp^DPt%AlY&L?ECZ8Oc26hSchI#1@Z10Y6 zsU2ATuzC8z)0XQ|FzRatCF*=sAU#a48x( za{Aox*JU!kwk8tX3rOkrnz4$5iUyezs5OLeC3KC`O}g-;tEY;xA&EnsK75X*%(~(d zMn601K{#iJ^<6M{t~Z*B<<&Z-kcj7RBE5}&!7v-?X^~jo zcmKS(a+eybWQlXY#o$u?S}aX6Su=W*L}hr+YK#59w)m*DGORpAgzzOGg;g~c5USNy&~V`%I60%1ud;LW>~}4e-{e0 zsmCA&_1%bXYws|MLa)JK&Bcz^3-w2`B0hHOE{cCd2MOTX#heBYDhKxR5Y&&n{w2u32*&Gj-pT18u_4JB=g#&|>aB zTIOKS(icq?P0QbvkCg|QZy9dbcTQlY5ceo1z54`EH7)Xj9T5{M)Mz$rVJTaO79!EG?MNM4uPE&CVx>@XsDdj3fp9 zE6_@P-;y;$cH#4jn#qenxJ)gNZGAc37Yvk+jK`g`LyMIDfIi>j(cVJO*C*iNR)sNn z*p9GbNK20B7sy4v3lExM+|!sT@fa?N{N7Q4T1GDcVK&e)w~eWA42<>i&VU~VJ<|F@ z6cyolN8{QjZ1>WUyRIF?cgQ-BJh@5^XET8IRvc#SU*?NIMj@Pojoaxh1C*k?myG`tlwDgR`EiQfHs{!Jpn$VD&<*%G>u^L$i>os zmgs(9rp3R3v!M9Y2}NI;$ycNSv8v1@{a8KDnZe(F&OY=rFGwl>Y#`i93yxL6i1@RQ z14%VhWwB+NeficU(Z2VTnJl=vZ4w>|6R|{5>+zRPLdxUJWMRE^tJx(poNVYt%eohj z#dX{0Cm>1@wW}`bqpX`;C@)t%B7**;Fdlo{=>{cG7y9Uo)Gfv0r6%wbpxiL8zw+g# z`Sq5qvwyUIn6R?tkJN=xs6mmzeKtj*R&b-SB?Yp_4ko!T8 zj>3EMypE4BuH!mC9>>Xc6_2wi?i#Yp0-&L%UtRH|uEv&UylCw`nM)Pmost%ge2ieZ zJH^gzDuz}PH?9T;57ZJU#QLk#>i;?~KT}-nu-tVdq%tT|W|=TEo;khfL~zbI%)nB ztL=LnXb}csZ_f9rm`X1o&Kq^PemEjKKo+}e91n9STRuRL)#t0AK+&m;r5pJvNp*i| zw1=7sKehQvksrB)0RQMzg+V);Td#JvO#QuU*mJ;vaUoe;BtSC>|K5Pc<4TTGsNk!% z(dZtrD_bK;IE!F}^xwo8)4=h;-O}ED%jfC(T3Y9#lX#Y~)<_AjpE_Ze0&zgQ*?0GM zg_7{dx4d!v`k*{vA^NA7bJoP-X0xFPNas#Znv{U%or*0(pHoJuITvS=^i*QJA%H#e}kw!RC^8Kemd#f5UeQ(CBBGX{Qz7dX;(kmJ> ze%#A&Au}y|Z*fM``IXl~tmaFPY@$noddq&A&5= z19w}U^$?}!W0-$6Hg36s&I=$EHs2!ctNsCK&)-|jh%DsMXRGe_v~$PkrbNnoF4)gs z8La%$g|={N>f#9Mlq?x>Cxo_v!K&js6Tu*+WhIyagL`C{Gx^aA*r<6qH1z}wjBaSC z@z!Z7`j)1g3b`vnO|MURU8Kq?;>cxQeUu~yEvjNEUit(^>75Zey`N0!44NQ|acTfO zmwiMAdzQ9V)8N$lJ~Znhb4I_ViF6toi^4!k@99lkyhFV}^kBFB+Q2&98X{cYYlO@= z?5IkKzMNQ01e95nB^5-mx_AvNpyG}z6e+=2g#y`y*vLOFHx3ciB?@&-vGa;W*yru4 z9Z}BfGplW$OE(0OlJO1ZhKhyut!i*pua=3#^&;Zw{^*fHbkJ#1-tgx z%0C#&JZutSuC(yMsHjidzS9!EXoRz50Jw>|4AKs*zj=|aM#9DD5iY9Dbaa~Kt}m?w7iUV|v-V;A-=RhK%4$uyESfJ2YY1Q$Ga zZ$LME!RWc<-;L%DBALGOy#AF!!@##@S@!i-%!)MPc4R0}Kr~+bxpXl44AQ};wn)5O z=TfX{;An&wqPf){_QT~g$JWPgAuk2{sQ2HYGTEn%XLI*b{+(DZ0pw93Hw~Y*rSj+O z4o_)W73oOkAwk{#Zh7eA2HD!2(NrhW877K?AWGDadv_U$+Y7Q362+6M(nEZ37Z0Yw zV2rwtql7raA~Lel*oI^{<@lb9{N5CYl#LX|b8$ik1)L@FExJs-vs?Lqs2RT4jVUZH zO+fTK#(vgrXLt5d`c>myy0h%yKnZJrPBvTcQ8n_b`fC1b!3oOVkpjJ?1LFGLO?nbI zbd86hGL{##K)-!qjHV;dtMIN;wmYOEw)>P#qJqFd9Xo2$y;20C!;fs6j>b(=zDO(r zF?*xXA#-o9ki9F__h=uq+)DG6gua?&$u++a1+qyvwsA*7gf3yvNaq&JC`XHnOpi2B z6S5|0a0#?xuvK=5Uj9WbJ~kvv`gE8dtTA_~!b2G=NCk>hX+lo8M6U{Rw3tS+hJ=Nu zqu5Sg-g%FdOxf9PAM#%tP0Osd_*1W-D0ZV6^>xI$v-bc*yFtKiC#LEBS@?geT3dcK z#%3#GoRV-hAgZ@YO&N$vRMR*bdNeBXzvf%vNlEl-w|tsBHRTCTG2bKJw~d*H&xW(- z#`Y$`#(|}79mSimCI*_ub3Z?ijL10b?Q&a%1sUAK+&1cSz|yN5-m-v+QFt|WQB7-} zP-+mTSb|f2ZuRFO z87ZAMylGBg@&K@y6bv&NK?$Wn!&Tq!QTRe}?=x7C3^QB5+mk3nK83V2+SD>Tg4lmH z7?7`eXrlwn`rk4%0f--|x0HrGrtQ}2Fcl{nk}~#$&lcZWzdW@(=bqN6!Ttl~Qh|sqXU3x2h=_=(L%$rhw3vMp(NB;qRdmt_-VR)Wjr1Ge!ZhdTn@; z^U{k+&a^;nK<5c?-<~#54ZZ9Ai-a&TEqBN!Uh3RrDuv$LrkjQ?I|qdjYC6{}M?+ka#N@QhM1>E@~W zw;HwmJe15=7|}2^1E_he~~YqlEh)tF^~nsNQeSg=a>SSAV(Ocik-a zYBY-whTT?OGFS5m5NVfojXd`)yuu)7?L-XP_6^P&vH^S63@p#(HqvNCiQw z4zG|P0~_NaJp3DCTz_xI*~bK6>A2Gjn1YryuW*3B(%HqRf$LD#ai;hBnGE_w0+sKx zqJt!yl?4js$VT%-go+awwu#J{%p?OXej_7_j-dWju!O= zy^B&!Cq)cf=`{4W(@XK*DWHfpMC|lmF89&*4ur{+#Brdq`VuYj#IsozZ^Ob%9`|R$ zW<{&Ss&Ui%=kl1b%)|_pT(3fN2XdR*+s0&m*o1GOE<3LomMB!H{z`?M|m5%oyUWt4J(bI1c`bJ4xiF&P z0fEwtgiD;a4%=@Bw@)SuXSiN%2RP#--k`dGe#d5IqjU6Ew9Z!&mwJKd;a9-BSo34? zY#RT#$Dz|0>Dw&YvCh=%VuGJ*qSE!8h3DVi4ZvG+y7c4`p!?~Q-k{fwZ26KknzeJE zm!6ejKWNth=gwn3@@GQ_q5m-71f99EIQm=?dw3?75W?~#(%glR6SU^h$^_oyp_tDr z_zKX=Mo2k9h(?df0+hwCrbA=SjZe7?2!zCh9r?NwM|fWXP@Yz@yMPk&bMX19<=}42 zenvvVWL<-0>u9g@qgmwahbEWL5Q?p_D4->E;v%%0#kg#p&@&C;q|Q8V)PIdb#Mh}z zGz@|q-k)D+jhi{-{lOy*(^v=yX@v%Kw!VpsynbhydUJI4*4S1 zKGS-}?Kdlcl~mtD2cq9yCq|Nvu%938c+&yvN3l~Sbud4_&A>#j$c=8?Wut^h^pTWc z$H-l(CE>}E5jnqVE+A37)`RLY#=Ws=>t^9S@d-CUBmey4*A-JhA?5j3_94{ZW&XUS zCE!-I9~g30Q7(KsFBAv;DYk9NP!izz%~fag94D<`WNNJErn5?>TOa-PsjYYr&dNlF zMP6(3!cZb(pxwF@+fRimtl2;ii=HMVN5oYa)7+d6nhoWF2!7$r3?QT19_ey}bTmvuBzwc6 zXDz-}|1Iq?f=-8w>reR5mAO1n_q_(&Cm#90aWjb`r*GnhA07{Rg zT!dii#bWkN#QKGS2h6~jb!os}ycFOv`^De89TB#p&a$gTqM&rRlStDk>(m%G_?{T@ zZ>+MWGd8xu-2wS}zG%Lq-?&NAl%H-#b&fsFJC$0Uvp-uyKn9e!sc_W`sHRk}Yj4&?naPpl|8426Xh$X!v59#V;!@kUl$#wHo_>U&zY-eD#;sWTH zJ7f*JE0HkRYG&lO8X=|GX`LJwFf9O)&uJ=kI0=P&nNCs6T^mjh4*CE@+Zsk%(Sg^b z0!7{)h0M@sn#6eK9iiIQqJISP{M}D||92(NP?aine~3}9B%V2$%2>kDA&o7u?n~>t}_2% zA^4tH@YpAR|7)}W{^DNgga{V6*vsnMPyWkSV1~YeMHOdZ#q;g`^KnkZuB06>!d}{R zJs26^B6}Heg4}b5&vLkON{XhMT&SG|!j=4O!0zVi9!DlJsBGns z+u19r;u#`QG)_{4q~@IFMi}^QFd6cS;X}BidDX7759riwN=(t|GHSZl;b}~PS}Z>} z0uq^-8`UJp=ckB>MhnQUfp7F+A2KUVX8aboJrx0hp zy`di^wlXVfxFTcrOvG7r`{HL*94lVyRz*Uh11CXjD zx^f2cSg4M%dKK=C>FYey9y)@JrmH8PF%kCKv|{MXU_ECe%BqVx-n5O)J$w)fu| zXNZZ@!%2--?G|f4EWa8L_1Z`p9u}u*8S`NM3A5Ev0lt5(VI{JH%=$9(I|L@9Iks3T z7qRC1&HJO9(4r;GZAGkXpOdl}Rub7y9yak7*0O0b0f~J0>ZfT+e+E<^pG;c>zkbLx zuiG_y>3tszA1n#=DTi||!-+tjmdH6SRzbtC=7A)qJAt)=vT4puCemw3dZ!C~cavhT z;n=j-(&Ki@nGv~oRY4>{X;*bYJRuilRWh2BLF2DPB&}qrCSGmPy3}i-80phA%mAY; zqN3*{lp@TK0!yvfp43HoqXFHg5-?xy>^u$>&%O427`1%);rL$UYkg`hll?^w*Lr2# zzkV8Si1egeR$@xT{jR=|^b=)A(K=uu*k@96E-*Yjf+(aNTvk>?T2dHYNK>&}h$`_5 z8Sof{o&v@%YWTq!ZJ%pAkv;qf=)aufn<%o%BOw({ZgUK+g(-Hg8M%hovx+@EOq zk5W7RUsQhk*A4BQ8PI*=q98=H>2>W9%G8pVct~iuF%I~U@I&+atEgu@Ph+MgS^>k{ zu@kdiA+5m)d)S;hJ2Wz)71(NNz4wSFsvMz_u;zJ50ZrtPBp%i)9fkwtd`*h*l~c~+ zCRA!t^Gb$ft#TBpdaEOG6S`)7Dz|tRUGW`91gk_q!f?fV>p8Z5NwU+WX;Ghq-_@k1 z>Zn{|5j-uoBqiKC4o;<$1%qD%3q~pBz+_?o6K~eK6Y_WS*<29Rpuz-lTdM=XX&&_< z1y_M+@MK|3bx@N*S zH^i_^v!oR;ccr3TJGBUaR<~J`z1wu-RYUQy+El(GAW*)jJ7WFw?e9j^?c=m{Rm!>$yDwi^6hVf|=<@6RX%r-o?qw;^ARxV<5^~;O zi<-Z)w|Ik*B71TdmP<@RvP^F^SsC2zNJc^y;MZ8=sbw1VBkV35wsvcd+Rlk7k#(^x z*^B;Q!lC8iWlojv5DWHOHa4TX2_YYMP->}y)>Ac6Ehr2!c4WhFALYwCnV60X322il zkg9^ydcPpSZ5pe}O2=E{VE!O*|4JnZ^BK1u*E}4_ez-z6-zFn7&9#JMpuDs^S9FouM~Ku5 z*AW!v7k9+Zdm@l3od(yne!D zGN>I;z?b78qXcPCE~v>S{`$E=XQ90W5)-rE%owx|kJF{vWTAT0ubgr3t+~x8le!>m zBQZRhko8IxbpDGSR@l&1WaU>ymy-~5qTa)GH7k1;unq>;PXP_FFieHTHx(XQa1dpj zM`i-^2`K%-pBvgzO`fU6?)$mU<}{K5)kQEzaN7%^DJ>o$FyY^u;uj40~7B64N6elsq?Mrfki8w7#b&CB9*OP za(bkPkgbd@`n$%1uc4guWvr~d$6XX!RWl%b-sV&1g^o69b5$R}@)ShMgY3Xh#{Aju zu+@bAy+}m2#BVXuvwZzCVw@bQVcqt@nMMZ+L&+)IQW`ABQ-=-t6Z zEi>#|jI9JgUXXt%4=lTtTdR;+DO6dCpSRY~zys)4K(p=rq3B=2r~@HZ|F*bS(!vu7 z@^GMF&46nSYbIYuAb#@KL?gwC%%zwV1zFb$H6kU*L}iJl$892)aF~NSPA`S|?K!!T zOqOm>xHy^IxB14a<6j#tJXpZVfBJ8Mp_oD?ra-+T8a`_!oo6PK$RBulX@+}=JxA-X z>;Rgiwi^AnG$n;tICjnhc*Ti*r`yZ|In8oq4hrKBehWBo7P3K!y0khzWQO{k&o!)o z{rs-FPc^Y>PpsaHDwsy#?gGaq_j7%>A-yg=7DfxAib2Z&D~8!oLCO}aioA2w(q9v_ zU=SxNLsN)+N9XCC@6AUd<4URkAu7aMmEkp7cE3}z55hpo?GDiE^x0vQ;IZMS7PHhp zeniCu%Ro11G%?iYZ*|T+iTYGudWV^NY*)l*a}I@5_>5HD_A@6_A9O7^hgklow%z_^ z9JI$CCNVc;8MSrS1m}})&vhXanku|$&nS@}4g*Y!gnbbQETo^N(2^pt(ZiNoGFIQ8 ziXiXQ9FCYcii?&INKn3UIDfT;rAe4P5C0G(@vrtD!#no18>)V$K;w!^!N|~!d$XvirVLXw!xRww zRp|-2qEPsUUcwnw;>~P6Cz-6R{-M(hKFa~t#8&Y#BUL6QHbfn{I-96rU{o6CKow$0 z%};)V2sP@dNNFKks1n(6V>=3d7BGSey1Hg7;S)L6`l>HMf~Xk-fw6CdSuGNne#yi8E&E-BTkmJZJT}|ISAXZ5!!YZq^SMdK{E2 zXHF*xw-PXLCiD~BxrNY;sd`hXpf~Zwi(K;Zp4LV^LNCUXhbdY>L<;}=Zi!QCn$di_ z2cJJxHm)DjHexvN23v6`UsCpt?3wXB4r_R(L{G%QY~BDlr_A_{J)nwKR-{{Gk(;e? zWaq1T*-MT1&p?c1r2+cgW>>gNRz93)U;XUxQhM`%xgr^y#U{HIivTo0`rgH5uPE)XZ$_Sz)xH1*6C^KvM2fK~UxdV_~VHCpa z&ZlVq+Tf;1HLMgYQAXFzy!hizXU`Q3nF!r^2GpS5fUa(DgT%$%U)N4J@ioUX{!WvH zEqwFg_cM{xQy{hll*wosWYiH&vy?ptGxuw^hYbA{xn$`B#R)tcW0z8#*9)dDXSTI9 z;3;?qAQ9UxOcZSP7p0VidY)mPUdJNO-yYk`*?;Z~6WM_)0L6&!F#HHy+cFoR8+~2^21#TS^<+eilPled-V?niuSf zl$oVM5cb0z>8GFh+U_jGIlnIzzUAGZX!hOu<1L;tq`$NR^Q>|c?)qDLbOx<~G7ZX+ z_9ma3w6@M7tQIW+4chc-FtgGKVe_YY6kn8iEv8^prVozC>R(peXpB*&s+AE((<~)_ zsSxHPw-UI9V{Bv%^dAG%#4q2lL`_p!Le1#fsrngnpV{=13QM7|UVs_VHvexV@|xIi zN3}SjUc8o7Bg$6RvD2`-n)~ta9~>qPN_iH|66PK<922S-1TCmQ%H%zq1SwH?w zr?zLtM(MR`fG64+3qMkt9*KSMIn#d4;0zWsK^(z2)eEl;dh2!*)C*3f8Q>$~;!h1Z zSv79<9i3pIZEaKs(YWHh1fTjiUh!J^3y*My&O3UXWYwb1T{T1+|2rVbT1Lx)3b+sxJ>4yO9&}z9FNHR!VF@=^cUS|1`Z|V9zMzi!}eGVI|L#n#*SlA zoHA{opxY4VBK+q}xqVcPGQgA7T%O64CeR$jskE7nW5BB)0PhY!I3?EEJxn5xI(+ZX z($O}lfk12ya*uI+MN8rCI{K=|Fw*aZsr3C|((8R^KFw9?!ayPs%DGi6%|{Zn5)R-C zx$~U`>0|67qnV_}I=mu2>1#;TJYtG{H~b8kqPed^Q>E6Xq#cW>uE#3G$zt*8Wvf!)47lWZ z?z~D=)T~x5Cb)RuSy;V zbMT~;8!~;gh!yu2GsB`M24rBbGd zh2Z0%APu^+>ed6tk#bEuoZ5}0vWK_7BdIVq+ole>U$v9x&W8c$9_49hD85z(s2o7i zhj6JOKXHMd@sIt`iMtb`k^H>OK1IYZktm~L4=u{s)_kbLjA7%JYF-;orfO$p+mc90 zwd?2%vQZ<(w(!L3V=_QqEys9NiEOae=@zPBN->4`bg<>GHKKY|d4l$nEj+Xu9_Ov3 zSHr?EibWb~(g^1twW0{k5Y9{73oR&&LFq#Tz^p!pSu{2LskiCVtL+}9T$h3zR6APj zPTiaChYR#Ql{X#W8R8-O8wUtU|G}LMy{0`wb*{{LhJOUa0~XGZkvskZdOE$(<4tKn z;F}93P4rD#FbUJhF}vZ1!1!YdzPag#Mq>jbK4m09cy4Iwj+mQP+HTa`-PFXMEkN1n z?F86uIgi0_3({mkD=*s*Ke4D&FsYdyasK@r=&P}CnX|H-C2>4RUH6XG@;oJ8`rbzY zfo!9d??vV+m?ekIOsx8^bG=7kkT9uZ?)Rf+nxf1(V{bbzBbTRTd}r1~Vyv=dUERs9 zKCT6U@+1A&ZPKoKvB#vFv8X~HJ;NIPm%5v{L^_{U;SF~CEqEihD)Fc(NH}*$peegn zlsTb?l0x^|A(|J4G?(K5#m$r+!np3*lo~&7pxFU{f<#R1?Gc*ZUttR7$u*Y_tfC{@ zn-4h7-gVM}b^l;Zj#5{@&8BHYQze%o=sNtnqTUr?)qRT8XLZfjccK80#wt_CCMGtLm8|GWlNaxa`=Ws_LDZ`{O+Yf8W^w~AUlkNzuaD>xJIwJK{a%>Npr1g!Z{3TZ zv-JJ>*4NaW#C~an5FtcvX7NbvEgY4|*6iA8cv!t3ypGa8dl$h{8?e+lx1kH z^a3CK!u8~{@QX1~I_F_?It6kpxI_E$KUz|hcm^xU6J^Y|Or_gjm)L8oSUR$N=4Wig z-hUfZ2HGWmLld>0=K#{xu_9|B%==4KfE>F>mx>oj<$7^ z(pD`k9o~*lpMx4&4Zqg#+rbR9&{8WgLK(AjyaS*Ii(+ot718v=t3w#kC3Qqx;3*cI z)*dT|DMDnY$T~>*sqF{}($IVl#J7MxI$GL!g13r&$MA5girzW8x>TyjQqpNqT4o%K zYJ{+NamOI-TNd=c;=&Y?yIVyY#!eqHtriO`LGpcEezvsTjNj(I)VxuYJC&B-xx8$u zmqM%tUk2p$3#9*cLVss|t`j1ptX7KLUVhlN<~ioZ%v^#zs(5EOw)V4w>-}CUxt~!{ zA>Unf7Li#r@CAiGbBivmY@naF_7a0B&GJ67qMk1O>#t?otbN|P=i`!;MctR5M``b@ ziHT;R;M&6>lgETE>xFBtxw}pGHrM5{Dp`l2-z*wiDKYa|nhX0B zOVV8{T(-HMt!SXr7AmH@Y!#;rE;O_=8Hl-Z2{nlV&W!LN6$ys7`5d0U>uAW44YhA( z73clb{L;v^zqqV*nc-%k9LL;?Yhz6Iw5!G}U;-Rhk}yN1Iq2d5i)nqyoiY|7*}$7D zN##MTJ{oA8DQdNThu}h#Hg6)}cvi-yG3s<&uJH&QzV*V4agu~edCvvkIqDJ|vNwvX_JY71I~wSshuNs8{KY3nuu~eM+#SVI~aHx7JEbH0o&<;$ourUN>?CheC1<<)f|%uxq-luLf$)g6WX5 zDE6ds0*3QATt5*3LeX?zUf}7!my66?#S8ah*6apUx>AC+nNQv7%*=aHh6;<}Yn=m4 zAjKX{)4H@=UpJX=qeV$#|}%w7(;T7 z`}1Uk1VST-+Wo_YmqS#hGJ8Y#`Jt5MTvn1`3sAL&e=8_L7;0+4f)vgY=|Kwa6e@p$ z)v9u63{l#g>z;_JP6h0JiPmc;p=`H{_3yau+Wh==pOt~g#4zG!wT&~K*X}y0#+=4B zZKk=6XAE=fzPmUZt?rP(`w1^=-&N)Ym#Q{i&L~^P#U(yE{K;K_373HH{P{Xg^4)1G zn`=<8M{GOz`KNcP&~b)l{@A!yv2Y5boEt$eO>=Y)m6|m@p?EekyR(PMEx665B@Q0N zb_Pj%EE1d@Twz_If2}ibu8o4T+8?$=l0V4Z2nu$+^RNbsL(&zT z2naFk_s9Ro51az&?UpD7sjYo{x#1#$^xWB1xwc2SRNN#gTVx_S*eOn1SP=>%r8AVT z3cX3GuZ9@+uIxP&RIlxXu-fEB!iWEx{Xq(bs`hSs_QugbIa#fYht3(b>I{Oy3d%RO zBl__gU#Do7EtUaO8)i0rHDaM%i63vsm6Al+wnW$vX!r0{(^3zgPPOc(SUNJZoX=QT zkn#~jrV*Im{cN?O@gM7-Kk`6PSHLC>_Y_>Sa!TKtMq_UFinJ*ytCpH#`kE9A-vh#I zhC}$#5ju$4WoJ0RgH)i=n{1DRz*kTaHbz|T3pQu$5hq4?tGBMt=*K|4I!aByt@g_#A%_S3|KE`@GDw@1qXfU6u)$*?W4SK%x; zcPs5HO9-O0Q-g}!jl~5d8+#_$K7J zZB(T&)KKamCQ~WeqzCFPB|ID(TDhNYSE=a@psAIiL$}Rs(nzNEc}EDwL-t9mosn<; zYtE_vS#D>fukxJ;OnpJ<FQ>^HK5DTNl zl@k7jvMkTHU1@&?roKBgPek$Ca1y7-8)jXXArK=^79K1#cS0@?u`)(AP=RV9LpYJ2 z;yA>6Z85fn==3O*2=D>@*@y$oD!HYXs?l?A6+SZ|5<+9kw~D#1y3 zO{^Oq=Euqv*;d{vcNho^t7!B2`bfG89SKo%jX1ZCZ)l7U)V7D9qg)xV`F_?87Y^mq zm(7%Tku{Sa8tD&p+B`UiW;P1znr0X7tpx2bPSMsDiSsMjS~4IW3~mnKxmCa9E%{j} zm<~o1lpl??MBKVFG~^|Q#QI$K(oZ&p!c%1_*VY|$&&sRcZpm!GXCAne%1;0$gjO`A07tjXSe)F$5KVfWJu-K(Y-jS^6)?Z0;+j81ENEr)yq}ms>AEUa0u7ct^^* zD1P_;*-by03DlYet+5qq%i$LzSgQ)lb~oR(hCCS2HmHvZhp=QO@yN9)CX418Jh0IQ z@A;J)Rv}nGZYZy%o!@}4CAL9+3F0EVL#qftQ>z1J`{G>pu3}U3hA5ksoR>8HfjK=t zk_Q}@YmIpoCC$B9Q8(^&KO@+KG_F9>iFOeQ241G34fD;WvIRZoL9n))Xif%hNxcOD z>E}J=IOXg_^h#na0NGFkkiMzVkb_<40?co3!TBwaUiHSQo6(`*!u$j@=>j<{1qHcS z2$B=?S!H2b6{PW*UB~{I0d5b_t+Tp1^#x@Yo4`ux2G)YM9ET!NS2`(OimuV;c(5Fg zq3g$=V>;>eW^}`StYUu)6x;?7hhW1ZwKDiquBggH!W5L_4f80S*+P(V>guM!tG8C6+e{Joj1)v=GRb^7o zCVU%9buKBdTL5BWc83O`} zS!`y`F84e5WI92k?d_jSnfhDE=cQQ%jAQRA(RlrLg21+b@PGpUrRq0EtY!QpgJ>fF zGr|5Fdq|UyPoN;1KHQX_wVT3AliLWkIDy(7jJ6~Gv?XG+?S>YL59S2 zrHo5`1TRm6xL76Sbre5q;<~KeEMQWh@T3tgP~Kizmoz@18*)(NX)ZH{8a9!NEK&3c z1=&vhmdEiuMpp*QvZ=+yzld7zJ_YBa+|kcN%I%9$Kt7PAJ%k4R0^A7x+A-yCS2!pJ zt195yzL885rFN%@Z`(a8w^Va9OOvuH4ivN84E-GW)iQ$tV|x#P2vgp=of8Ji$x@5p zPX#02G1;~tPV^MBm;0AQvW^>#(16xPI|ICRB74ogk;39cKgS%_&s&5MvV){^NY51% zbOoOALjvOfR_sXkE&rAG+uA46bxm$G22{l@S5F|baQy_WO#~dOjnY&ND97{#g}6oF zPyb-K35I@TIG@H`nFTRS1BGg`G4kNPmO%-9zE~!nmcm;WU#&cNR zY{nP}9$BxaE6B)2UAhDG#-Erm1fKdG>w1)Ipq}tcg{)712H5mF4n4uFPvA-`?Q!0Q z*9p&BL8P`#qUE_xP&ykYt2-(FXdQ^l>LbUMLz9S5j0XJ@TOrMqvxfTXdYbWxujS6? z;#G~sN9B>s#9CQIW)u7G6~o-yI-uLvST!l4*|sfXV1iAy7pa-!y0v$%me~hYWa=kV z&US@BIv{piE0*-HB4mEV79{XqbE-3wl1ScRvr z4qegv!OSB^X7D|6#QC(-JLOQ~&iiEmY4_AI|AaCOaCS_G`c67}nwi&=-z=h0!)OM6 zbbLHge&q{6uF`r;pNfyPC0^-wzE16s-W2iroS)p7kpA{e77kw1eRw@|we3PPg(Be@ z^;}2+9JsXy3Cu042h1!ewK`b+a#g(wW@Do&F1m?%uI0N{9VXF#!LN zZfjgXkzY^x4)qmqY_<5N93)v0r4;lHBPOo)mwa0)9($pha=_^pzY3T!ABv=j!;k;0 zwyHBX0gy?ck<*&nz3d}}8V+Ue6}MaX3EDy2+}x-z$!~8Jx21wW|Ivmh=G)Ktv@^XO zdb1&bc`6(-dW8)>6x(N<^iS)Z(^TGk#NIagt|Tz` zJhad_T1Z^*)$HK6P6i8FYLPjng{mjW>LEO2%mZ8}NXALXoOoo>b+F7qE=cgfOBbeD z-S`8R6eT({WR`65+FhTiqgBHwB(X(fe{;PxKZ0*ka@9ciRNh_Ppp(MpoF442hjjAu`mubG=x0c-?;n#OuV6S`!6mrn2R9>lb^7!|#h0G%D8&a-hEAs_4 zhLQvoN#h#0+r3+;ng<5~Ub?s%-&N<@XMP#$leJ6+n+B~UwW-((AQg!W&rW8YDlzbP zu}M=x3*oxW2Y)9RL#t`9HK)Yt&ppZ2a@id5m?GYNYc6BGydQ6&_hh_BkEwz&UVZwA zMaK{JBUKSmD$_5qzGTQdAHQs)^YFvpS`=65;3&*TCZ?ka?lb2Mc_>}MNs8}hce>b- z9jel8tk$MZ7t@rdFLn3FzYN~$4`v*atC?!fOUS9lx0U_JT?qm;rpv>eFeCZ#IiQq$ zC#t7c%WDSEhQ9rarf&kVH>84Nx8607uld^vK%|M5UC<@({z#RtIo}mhh!rx9G|uEG zpDO-nH+Av*qPPd|(8A3Kd4+20H$%Kizw1!N_!94-yf!Tk)K{oM{rcXv@yq4Yvpjf8 zQ>Cs%O@2;vO@{GKvny#|ozOv;v+bB;-fcJ}w<6;ZjAhZ5IoS2(M;Bmxs`Uu@7Go$w zm|W-MbL0V$Bb*Id^Gg-9ik3$bNT_M$lQ%&*xJO zQt7P>pPm0$#=a|qtqvhPG-eSMpA!w8SBPHsqqeNv1o)~LpGAypnqkuCRm;2L8-j~Kcqi3zshSU2CTzaJX9)jt5 z55clC|5@wz?$(@5qOA^SD?VkJG7B8BkTofjPpevB1+W~$>jB4$?m9nUmWGDu&dIaI zJy3uLGJbdwlp1?|2Tu~o999r$0tTAlyy2>QBnuKm`{sN0JVMz6qt7is0@ChI z+!aG5=wR@;>3hf%W-IHZT4e<@Kq$FAQ&U3 zZRNt{*4sZhJZEMi*5roQ6zU*Y+X5Df@YoakYT9S}S9G*74#t&5NRAvab(Cv;7$t9=nsKqAK zM+q#O4c41S|djG@dFRpfwlRocK*Y`{5hP%k6n4U7`_YC{P*C zCT_bJNdCjR>@^<8-o7i*?rRWco8c87oCLNaITGLV&Z~Z-xt#7C2H)ie`0bRrh|Ywi zqyeO~e%!GaHN_q6;R%@ZL6g0UVb<@4Y!!M?+XGyta?*92pEXH6b~$ofN6-4TT14@%{pF7-js`~6?v?t%==lXOWZf~Y{#t5py&^N_d2dJBo;K%oCTtXn z{fVz8QaXADmlSU^0+Y@o9?9v2j#8QcAR|v4iT$(M5(h{G#3O@zW<%%|p=C1rL?K?ch zd`TuA#ppAQGbzH0k4@jU^D-F7wH=W4xaWae0wHD-y$L$Rg7T%jrKB?iu(1AO#1>(p zW1GAHL`H!#Ki)?DJ4O5%4n^$1Un^9KHLSV(w3S^R^JZO0ZF<=FNP;%crpO>cI*41p z-bm7c=FtZHcmbvwGHPh-HkLKKCWB@UKvcFAm@^cCAA)9qVfvM^FGP|3yzP{%L2bIsp_d&knK6ti@(gaF$DPX7?$%C zix12eq73JJxjBQ^o(izRP}OYTBsK5C$Z+@nB)fCV6u&(ijuv>q#lhW6P&Zr)<0C zM~E8N9qa&x(<<-}BP#1v9d|^tBOWC7cizP(DyYCT!q<-O07s7tRwDdTX7sSrFeP=h zEYB$$L1|zjy-|z_!`!fSoksCdq_zk;xM8Uc#~fc<%#Qq-keL5kx1*XFIMCtqumzif zD)o9Q9kb4^>VfNA0=25TyigZcr_at)@75KQ-SQ$lBYEU!u}vb!xO~C%bw>;OR3U^U z>m6Sc4DO0hB4SLZX0V*5-XBb=;sNk@#R8ZUWfM$&0F8jNznvsq2SsQg%l%P6ETp#x zV|ml}sGjm@fX(0Xmtp+P?TTQYD)$tfvR4vG+))Dm`b**q|8-)~y12Z0SZ9=;%o1!ILs53 zB!omDu5|mM#~Z9W>jUA-l%Sh9H+|8 zit+w}a{iIv&;g+A<;WiyZ0>1Qs>AFjyx5phw?o7ry>XWr zES`WHah*(lF*!MHLvv|zA&PUs2A_R*`MTyN7+c^u6wF06V32eTP+qgV-ppGHPr?BRlLO(oiJ3 zmggB#c`!*XwB+nHjkcGrraysgEkMAI79V=Z2hwO*7@@Bu;>5 zU+sjByn6#`Q{9JsI^RHSh!u#CkO-viJl4ikQmkI2;UXu0bQ5YkxWjXeM3wDRoTgh* zhOq=0?Xh|U`M4s>PgylNeU!Oe6<>ei9HJ)m-g}eJC>q+(MO(<6>T#dL4CASr`(juo zzhkeq5Ue6s;vfO*R#pDan2)M&}$; z%q^RREjn+lXS+!_O<^byz%~twp)u@{)sTF|DM_&pJ2U)(AFJM zOzVOy7G=+7bU#E(ciT3v0&Y3c=Zd`pT{OlVKR%&ctO&Q|$NQOu_On-F;p!+A z|9FBg*!nEKS+bu!SzLgC+fE^xuz_6xtvHT5{1L%Pe+2rXRR{d$sb|(TsH4yiJ@Zqg zIKgB zm;e)QAA#}JGq7vM#^S9veq>U+K1uvd)$1holdT1sa*ftcorX1nhT|7g zZm}p9Pe#j6sLmQ@96kY$5>B}8Gi#d>_}$#@*>M6r=uF>t70hGb+a#wXT@n6kK|x

\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamecontents
0mars.pdfMars\\nMars, the fourth planet from the Sun, is...
1mars.pdfBasic facts about Mars:\\n· Distance from the S...
2earth.pdfSolar System\\nOur solar system is a vast and f...
3earth.pdfSolar System\\nFor more details about our Solar...
4earth.pdfEarth\\nEarth is the third planet from the Sun....
5earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...
\n", - "
" - ], - "text/plain": [ - " filename contents\n", - "0 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", - "1 mars.pdf Basic facts about Mars:\\n· Distance from the S...\n", - "2 earth.pdf Solar System\\nOur solar system is a vast and f...\n", - "3 earth.pdf Solar System\\nFor more details about our Solar...\n", - "4 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", - "5 earth.pdf Earth\\nBasic facts about Earth:\\n· Distance fr..." - ] - }, - "execution_count": 29, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "output_df[['filename', 'contents']]" - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "id": "6bdd3515", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "6bdd3515", - "outputId": "00705442-b6ae-4238-b0f5-c94de690ecb4" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "========== mars.pdf ===========\n", - "-------Chunk 0------\n", - "Mars\n", - "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", - "-------\n", - "-------Chunk 1------\n", - "Basic facts about Mars:\n", - "· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", - "· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", - "· Moons: Two small moons, Phobos and Deimos.\n", - "-------\n", - "========== earth.pdf ===========\n", - "-------Chunk 0------\n", - "Solar System\n", - "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", - "-------\n", - "-------Chunk 1------\n", - "Solar System\n", - "For more details about our Solar system see Chapter 1.\n", - "-------\n", - "-------Chunk 2------\n", - "Earth\n", - "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", - "-------\n", - "-------Chunk 3------\n", - "Earth\n", - "Basic facts about Earth:\n", - "· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", - "· Rotation Period: 24 hours (one day)\n", - "· Moons: One moon, called Luna or simply \"the Moon\".\n", - "-------\n" - ] - } - ], - "source": [ - "for f in output_df['filename'].unique():\n", - " print ('==========' , f, '===========')\n", - " chunks = output_df[output_df['filename'] == f]['contents']\n", - " for idx , chunk in enumerate(chunks):\n", - " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" - ] - }, - { - "cell_type": "markdown", - "id": "2b34d9c6", - "metadata": { - "id": "2b34d9c6" - }, - "source": [ - "### 7.4- Understanding the output\n", - "\n", - "So we started with 7 rows and ended up with 6. Fuzzy dedupe removed the following **very similar** chunk.\n", - "\n", - "These are pretty similar chunks except for the words 'the' and 'our'\n", - "\n", - "**earth.pdf**\n", - "\n", - "`For more details about *our* Solar system see Chapter 1.`\n", - "\n", - "**mars.pdf**\n", - "\n", - "`For more details about *the* Solar system see Chapter 1.`\n", - "\n", - "Pretty neat, eh? 👏\n", - "\n", - "### Configuring Fuzzy de-dupe\n", - "\n", - "You can tweak fuzzy dedupe by tweaking the following parameters\n", - "\n", - "```python\n", - "# fuzzy parameters\n", - " \"fdedup_num_permutations\": 64,\n", - " \"fdedup_threshold\": 0.7, # (default 0.8)\n", - " \"fdedup_shingles_size\": 5,\n", - " \"fdedup_delimiters\": \" \"\n", - "```\n", - "\n", - "In our case, we set `fdedup_threshold` parameter to 0.7. \n" - ] - }, - { - "cell_type": "markdown", - "id": "5370950a-2a3a-4143-8218-f9b4808099ba", - "metadata": { - "id": "5370950a-2a3a-4143-8218-f9b4808099ba" - }, - "source": [ - "## Step-8: Text encoding\n", - "\n", - "Encode text for the vector storage." - ] - }, - { - "cell_type": "markdown", - "id": "85aba685", - "metadata": { - "id": "85aba685" - }, - "source": [ - "### 8.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "id": "20a153fa-fd56-401e-86be-4f7617affcc8", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "20a153fa-fd56-401e-86be-4f7617affcc8", - "outputId": "e1795167-9fac-4b7c-9417-f655c30848a1" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-6: Processing input='output/05_fuzzy_dedupe_out' --> output='output/06_embeddings_out'\n" - ] - } - ], - "source": [ - "STAGE = 6\n", - "\n", - "input_folder = output_fuzzy_dedupe_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_embeddings_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "c97545f4", - "metadata": { - "id": "c97545f4" - }, - "source": [ - "### 8.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "id": "228df6b2-bc62-494b-9697-03ece98d7853", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "228df6b2-bc62-494b-9697-03ece98d7853", - "outputId": "f4c2cba4-aed0-4eee-873b-d1a8abf60cbd" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "18:40:39 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", - "18:40:39 INFO - pipeline id pipeline_id\n", - "18:40:39 INFO - code location None\n", - "18:40:39 INFO - data factory data_ is using local data access: input_folder - output/05_fuzzy_dedupe_out output_folder - output/06_embeddings_out\n", - "18:40:39 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:40:39 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "18:40:39 INFO - orchestrator text_encoder started at 2024-09-18 18:40:39\n", - "18:40:39 INFO - Number of files is 2, source profile {'max_file_size': 0.009204864501953125, 'min_file_size': 0.009014129638671875, 'total_file_size': 0.018218994140625}\n", - "18:40:41 INFO - Completed 1 files (50.0%) in 0.003 min\n", - "18:40:41 INFO - Completed 2 files (100.0%) in 0.003 min\n", - "18:40:41 INFO - Done processing 2 files, waiting for flush() completion.\n", - "18:40:41 INFO - done flushing in 0.0 sec\n", - "18:40:41 INFO - Completed execution in 0.032 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:6 completed successfully\n", - "CPU times: user 816 ms, sys: 204 ms, total: 1.02 s\n", - "Wall time: 2.53 s\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "from data_processing.runtime.pure_python import PythonTransformLauncher\n", - "from text_encoder_local_python import TextEncoderPythonTransformConfiguration\n", - "\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "params = {\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # text_encoder\n", - " \"text_encoder_model_name\": MY_CONFIG.EMBEDDING_MODEL,\n", - "}\n", - "\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "# create launcher\n", - "launcher = PythonTransformLauncher(TextEncoderPythonTransformConfiguration())\n", - "\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "b734852c", - "metadata": { - "id": "b734852c" - }, - "source": [ - "### 8.3 - Inspect Generated output\n", - "\n", - "You will see a column called `embeddings` added at the end. This the text content converted into vectors or embeddings. We used the model `sentence-transformers/all-MiniLM-L6-v2`" - ] - }, - { - "cell_type": "code", - "execution_count": 33, - "id": "7b1c1d09", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 205 - }, - "id": "7b1c1d09", - "outputId": "86c49244-9f9f-4116-fb17-c27ff6c29bc7" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Input data dimensions (rows x columns)= (6, 18)\n", - "Output data dimensions (rows x columns)= (6, 19)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filenamecontentsdoc_jsonpathpage_numberbboxchunk_idremovedchunk_hashembeddings
0mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...6[]-1[0.07728295, 0.024970993, -0.043180738, 0.0580...
1mars.pdf101125064eb4-470e-4d7e-b2f5-84d59cbbe6f1pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-09-18T18:40:07.6821060.838944mars.pdfBasic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7[]-1[0.10598018, 0.025460618, 0.023627337, 0.03905...
2earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...0[]-1[0.0077404436, -0.02055944, 0.026426593, 0.011...
3earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...1[]5[-0.062105548, -0.0053322907, 0.031277698, 0.0...
4earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...2[]-1[0.072435796, -0.058001805, -0.019771898, -0.0...
5earth.pdf1011e1053a34-3cc1-45c1-abe7-204a240152c0pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-09-18T18:40:06.8313340.857239earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...3[]-1[0.091821924, 0.015197902, 0.07716932, 0.01711...
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements \\\n", - "0 mars.pdf 1 0 11 \n", - "1 mars.pdf 1 0 11 \n", - "2 earth.pdf 1 0 11 \n", - "3 earth.pdf 1 0 11 \n", - "4 earth.pdf 1 0 11 \n", - "5 earth.pdf 1 0 11 \n", - "\n", - " document_id ext \\\n", - "0 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "1 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "2 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "3 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "4 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "5 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "\n", - " hash size \\\n", - "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "2 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "1 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "2 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "3 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "4 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "5 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "\n", - " contents doc_jsonpath \\\n", - "0 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "1 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", - "2 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "3 Solar System\\nFor more details about our Solar... $.main-text[3] \n", - "4 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", - "5 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", - "\n", - " page_number bbox chunk_id \\\n", - "0 1 [132.87440491, 500.84011841, 477.48345947, 534... 6 \n", - "1 1 [133.2026062, 482.90710449, 237.04431152, 493.... 7 \n", - "2 1 [132.87112427, 588.96014404, 479.40917969, 623... 0 \n", - "3 1 [133.20942688, 570.81555176, 375.57919312, 581... 1 \n", - "4 1 [132.91053772, 512.46295166, 477.84887695, 534... 2 \n", - "5 1 [133.30151367, 494.86206055, 240.17156982, 505... 3 \n", - "\n", - " removed chunk_hash embeddings \n", - "0 [] -1 [0.07728295, 0.024970993, -0.043180738, 0.0580... \n", - "1 [] -1 [0.10598018, 0.025460618, 0.023627337, 0.03905... \n", - "2 [] -1 [0.0077404436, -0.02055944, 0.026426593, 0.011... \n", - "3 [] 5 [-0.062105548, -0.0053322907, 0.031277698, 0.0... \n", - "4 [] -1 [0.072435796, -0.058001805, -0.019771898, -0.0... \n", - "5 [] -1 [0.091821924, 0.015197902, 0.07716932, 0.01711... " - ] - }, - "execution_count": 33, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from my_utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "id": "f5e12630-be6b-4188-a925-77117155617b", - "metadata": { - "id": "f5e12630-be6b-4188-a925-77117155617b" - }, - "source": [ - "## Step-9: Copy output to final output dir" - ] - }, - { - "cell_type": "code", - "execution_count": 34, - "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", - "outputId": "aa667c65-8421-4d4d-f57e-47ccc4ea41ad" + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", + "outputId": "31f09b58-7b2d-48bb-9dac-bc0ba9625c01" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "✅ Copied output from 'output/06_embeddings_out' --> 'output/output_final'\n" + "✅ Copied output from 'output/05_embeddings_out' --> 'output/output_final'\n" ] } ], @@ -3836,7 +3299,7 @@ "provenance": [] }, "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "dpk-1-basic-022dev1-py312", "language": "python", "name": "python3" }, @@ -3850,27 +3313,26 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.9" + "version": "3.12.7" }, "widgets": { "application/vnd.jupyter.widget-state+json": { - "0a1ed94698ca4e4291c553929e0ca66c": { + "06f9b33494984e4885d5aad813d1d2bc": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", - "model_name": "ProgressStyleModel", + "model_name": "DescriptionStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", + "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", - "bar_color": null, "description_width": "" } }, - "2eea7bc810e54eaeb325136352b71e66": { + "1cb3bbf7d724411cbe9831543a4aecc0": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", @@ -3922,46 +3384,7 @@ "width": null } }, - "3077f04af3a9447ab98717bd3131cd8f": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "DescriptionStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "4f63bfad92b64e7bae18e720376d402d": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "FloatProgressModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_709685da1c6c4164bed658357a2191bf", - "max": 7, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_0a1ed94698ca4e4291c553929e0ca66c", - "value": 7 - } - }, - "5dbc6889a9c243c5a922f8cc5f1a704c": { + "553f3c16839a49d79591d0fc4862bed6": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", @@ -4013,7 +3436,7 @@ "width": null } }, - "6957a659451b46dab702c1c62fa9cdd2": { + "7053c9606a414e978636a7e241909504": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HTMLModel", @@ -4028,13 +3451,51 @@ "_view_name": "HTMLView", "description": "", "description_tooltip": null, - "layout": "IPY_MODEL_5dbc6889a9c243c5a922f8cc5f1a704c", + "layout": "IPY_MODEL_1cb3bbf7d724411cbe9831543a4aecc0", "placeholder": "​", - "style": "IPY_MODEL_d6e520e4da004c818031ccfcc3588e5d", - "value": " 7/7 [00:00<00:00, 221.60it/s]" + "style": "IPY_MODEL_06f9b33494984e4885d5aad813d1d2bc", + "value": " 10/10 [00:00<00:00, 349.38it/s]" + } + }, + "724778729161445c98b187031ae4f67c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "97b603697cfa4b4ea4e6735b6768ca35": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_e87e8d3262c54cfaaa8768505edacda3", + "IPY_MODEL_b78aa40816e44f7fbebcb24ca68818b3", + "IPY_MODEL_7053c9606a414e978636a7e241909504" + ], + "layout": "IPY_MODEL_da0787b239764847a731083997780a85" } }, - "709685da1c6c4164bed658357a2191bf": { + "9d184ed175f0403fb03c2e13dfd04e0a": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", @@ -4086,50 +3547,31 @@ "width": null } }, - "7616f1b493e1461c9fd1319fae3bc10b": { + "b78aa40816e44f7fbebcb24ca68818b3": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", - "model_name": "HTMLModel", + "model_name": "FloatProgressModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", + "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", - "_view_name": "HTMLView", + "_view_name": "ProgressView", + "bar_style": "success", "description": "", "description_tooltip": null, - "layout": "IPY_MODEL_ebc626c0750c470db6789b26acf15f60", - "placeholder": "​", - "style": "IPY_MODEL_3077f04af3a9447ab98717bd3131cd8f", - "value": "Fetching 7 files: 100%" - } - }, - "8226b2522ce446f6bd3a36c4e227370c": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HBoxModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_7616f1b493e1461c9fd1319fae3bc10b", - "IPY_MODEL_4f63bfad92b64e7bae18e720376d402d", - "IPY_MODEL_6957a659451b46dab702c1c62fa9cdd2" - ], - "layout": "IPY_MODEL_2eea7bc810e54eaeb325136352b71e66" + "layout": "IPY_MODEL_9d184ed175f0403fb03c2e13dfd04e0a", + "max": 10, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_724778729161445c98b187031ae4f67c", + "value": 10 } }, - "d6e520e4da004c818031ccfcc3588e5d": { + "c0eb5bc8f6ee427ca42204b3c56f9a4e": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", @@ -4144,7 +3586,7 @@ "description_width": "" } }, - "ebc626c0750c470db6789b26acf15f60": { + "da0787b239764847a731083997780a85": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", @@ -4195,6 +3637,27 @@ "visibility": null, "width": null } + }, + "e87e8d3262c54cfaaa8768505edacda3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_553f3c16839a49d79591d0fc4862bed6", + "placeholder": "​", + "style": "IPY_MODEL_c0eb5bc8f6ee427ca42204b3c56f9a4e", + "value": "Fetching 10 files: 100%" + } } } } diff --git a/examples/notebooks/intro/dpk_intro_1_ray.ipynb b/examples/notebooks/intro/dpk_intro_1_ray.ipynb index 7ce746c67..6a14dedc7 100644 --- a/examples/notebooks/intro/dpk_intro_1_ray.ipynb +++ b/examples/notebooks/intro/dpk_intro_1_ray.ipynb @@ -13,7 +13,8 @@ "\n", "Here is the workflow\n", "\n", - "![](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/notebooks/intro/images/data-prep-kit-3-workflow.png)\n" + "![](https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/images/data-prep-kit-3-workflow.png)\n", + "\n" ] }, { @@ -27,7 +28,7 @@ "\n", "Two options:\n", "\n", - "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/main/examples/notebooks/intro/dpk_intro_1_ray.ipynb)\n", + "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/main/examples/notebooks/intro/dpk_intro_1_python.ipynb)\n", "- **Option 2 - Local python dev environment:** Setup using this [guide](../../../README.md#-getting-started)\n", "\n", "The notebook will work as in both environments" @@ -42,30 +43,10 @@ "source": [ "## Step-1: Inspect the Data\n", "\n", - "We will use simple PDFs about Solar system. The files are [here](https://github.com/sujee/data-prep-kit-examples/tree/main/data/solar-system)\n", - "\n", - "- [earth.pdf](https://github.com/sujee/data-prep-kit-examples/blob/main/data/solar-system/earth.pdf)\n", - "- [mars.pdf](https://github.com/sujee/data-prep-kit-examples/blob/main/data/solar-system/mars.pdf)\n", - "\n", - "### (Optional) How to create PDFs?\n", - "\n", - "If you like to play around with various inputs files, follow these steps to re-generate PDFs.\n", - "\n", - "**Option 1 (Easiest): Use a word editor or google docs editor**\n", - "\n", - "Write your content and export as PDF\n", - "\n", - "\n", - "**Option 2: markdown -> pdf**\n", - "\n", - "First edit the markdown files using any text editor.\n", - "\n", - "Then use [pandoc](https://pandoc.org/) to convert them to pdfs.\n", + "We will use simple PDFs about Solar system. The files are [here](https://github.com/sujee/data-prep-kit/tree/main/examples/notebooks/intro/input/solar-system)\n", "\n", - "```bash\n", - "pandoc earth.md -o earth.pdf\n", - "pandoc mars.md -o mars.pdf\n", - "```\n" + "- [earth.pdf](https://github.com/sujee/data-prep-kit/blob/main/examples/notebooks/intro/input/solar-system/earth.pdf)\n", + "- [mars.pdf](https://github.com/sujee//blob/main/examples/notebooks/intro/input/solar-system/mars.pdf)\n" ] }, { @@ -87,11 +68,7 @@ "execution_count": 1, "id": "1fe354b7", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "1fe354b7", - "outputId": "6fe04a4c-8092-49bb-f4ee-ffdcd42b6c11" + "id": "1fe354b7" }, "outputs": [ { @@ -128,19 +105,15 @@ "execution_count": 2, "id": "3309799e", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "3309799e", - "outputId": "5af8cfbc-346d-41bd-c14e-c917d0f403f3" + "id": "3309799e" }, "outputs": [], "source": [ "if RUNNING_IN_COLAB:\n", - " !mkdir -p 'input'\n", - " !wget -O 'input/earth.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/input/solar-system/earth.pdf'\n", - " !wget -O 'input/mars.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/input/solar-system/mars.pdf'\n", - " !wget -O 'utils.py' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/my_utils.py'" + " !mkdir -p 'input/solar-system'\n", + " !wget -O 'input/solar-system/earth.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/input/solar-system/earth.pdf'\n", + " !wget -O 'input/solar-system/mars.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/input/solar-system/mars.pdf'\n", + " !wget -O 'my_utils.py' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/my_utils.py'" ] }, { @@ -158,12 +131,7 @@ "execution_count": 3, "id": "1fcec577", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 1000 - }, - "id": "1fcec577", - "outputId": "93aa2df3-0cf5-4b04-84bb-6803bbf46df6" + "id": "1fcec577" }, "outputs": [], "source": [ @@ -219,7 +187,7 @@ "base_uri": "https://localhost:8080/" }, "id": "e4YMZrBuFycl", - "outputId": "8a316776-582c-4d01-80de-cd530081a080" + "outputId": "54e232da-b2a8-4f3e-d983-94259505dad3" }, "outputs": [ { @@ -250,7 +218,7 @@ "base_uri": "https://localhost:8080/" }, "id": "33345487", - "outputId": "47dca359-2740-493d-83eb-1291617d3db1" + "outputId": "c14c3a3d-c074-4535-b75d-19c5effa7d94" }, "outputs": [ { @@ -272,10 +240,8 @@ "\n", "MY_CONFIG = MyConfig ()\n", "\n", - "if RUNNING_IN_COLAB:\n", - " MY_CONFIG.INPUT_DATA_DIR = 'input'\n", - "else:\n", - " MY_CONFIG.INPUT_DATA_DIR = os.path.join (os.path.abspath (''), '..', 'data', 'solar-system')\n", + "MY_CONFIG.INPUT_DATA_DIR = 'input/solar-system'\n", + "\n", "MY_CONFIG.OUTPUT_FOLDER = \"output\"\n", "MY_CONFIG.OUTPUT_FOLDER_FINAL = os.path.join(MY_CONFIG.OUTPUT_FOLDER , \"output_final\")\n", "\n", @@ -339,7 +305,7 @@ "base_uri": "https://localhost:8080/" }, "id": "60ac8bee-0960-4309-b225-d7a211b14262", - "outputId": "704d5f45-5d49-43b0-afeb-1dddf2aa326d" + "outputId": "fd42f265-445f-488c-8c62-b293424f162d" }, "outputs": [ { @@ -404,14 +370,14 @@ "base_uri": "https://localhost:8080/" }, "id": "482605b2-d814-456d-9195-49a2ec454ef0", - "outputId": "5ef25857-46d4-463e-f847-369d18cb2d8d" + "outputId": "f4c02b6f-effd-4d04-8547-f270f721f8d2" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "🏃🏼 STAGE-1: Processing input='/home/sujee/my-stuff/projects/ai-alliance/data-prep-kit-examples/dpk-intro/../data/solar-system' --> output='output/01_parquet_out'\n" + "🏃🏼 STAGE-1: Processing input='input/solar-system' --> output='output/01_parquet_out'\n" ] } ], @@ -443,38 +409,38 @@ "base_uri": "https://localhost:8080/" }, "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", - "outputId": "7a069b9a-1159-4993-d2b0-b26b16235f6b" + "outputId": "2cb0721a-1526-4129-a72f-77c1beefafdb" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "18:49:32 INFO - Running locally\n", - "18:49:32 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", - "18:49:32 INFO - data factory data_ is using local data access: input_folder - /home/sujee/my-stuff/projects/ai-alliance/data-prep-kit-examples/dpk-intro/../data/solar-system output_folder - output/01_parquet_out\n", - "18:49:32 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:49:32 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", - "18:49:32 INFO - pipeline id pipeline_id\n", - "18:49:32 INFO - code location None\n", - "18:49:32 INFO - number of workers 2 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}\n", - "18:49:32 INFO - actor creation delay 0\n", - "18:49:32 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}\n", - "2024-09-18 18:49:33,959\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:37 INFO - orchestrator started at 2024-09-18 18:49:37\n", - "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:37 INFO - Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}\n", - "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:37 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.135861206799746, 'object_store': 4.06793060246855}\n", - "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:37 INFO - Number of workers - 2 with {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:37 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(RayTransformFileProcessor pid=1212179)\u001b[0m 18:49:40 INFO - Initializing models\n", - "Fetching 7 files: 100%|██████████| 7/7 [00:00<00:00, 167772.16it/s]\n", - "\u001b[36m(RayTransformFileProcessor pid=1212180)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n", - "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:46 INFO - Completed processing 2 files in 0.14 min\n", - "\u001b[36m(orchestrate pid=1211297)\u001b[0m 18:49:46 INFO - done flushing in 0.001 sec\n", - "\u001b[36m(RayTransformFileProcessor pid=1212180)\u001b[0m 18:49:40 INFO - Initializing models\n", - "18:49:56 INFO - Completed execution in 0.4 min, execution result 0\n", - "Fetching 7 files: 100%|██████████| 7/7 [00:00<00:00, 38031.25it/s]\n", - "\u001b[36m(RayTransformFileProcessor pid=1212179)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n" + "22:45:46 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", + "22:45:46 INFO - pipeline id pipeline_id\n", + "22:45:46 INFO - code location None\n", + "22:45:46 INFO - number of workers 2 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}\n", + "22:45:46 INFO - actor creation delay 0\n", + "22:45:46 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}\n", + "22:45:46 INFO - data factory data_ is using local data access: input_folder - input/solar-system output_folder - output/01_parquet_out\n", + "22:45:46 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:45:46 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", + "22:45:46 INFO - Running locally\n", + "2024-10-16 22:45:48,783\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:45:52 INFO - orchestrator started at 2024-10-16 22:45:52\n", + "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:45:52 INFO - Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}\n", + "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:45:52 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.14609298761934, 'object_store': 3.073046493344009}\n", + "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:45:52 INFO - Number of workers - 2 with {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1} each\n", + "\u001b[36m(RayTransformFileProcessor pid=1001895)\u001b[0m 22:45:55 INFO - Initializing models\n", + "Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 103563.06it/s]\n", + "\u001b[36m(RayTransformFileProcessor pid=1001895)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n", + "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:46:00 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:46:02 INFO - Completed processing 2 files in 0.033 min\n", + "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:46:02 INFO - done flushing in 0.001 sec\n", + "\u001b[36m(RayTransformFileProcessor pid=1001896)\u001b[0m 22:45:55 INFO - Initializing models\n", + "Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 126716.13it/s]\n", + "\u001b[36m(RayTransformFileProcessor pid=1001896)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n", + "22:46:12 INFO - Completed execution in 0.43 min, execution result 0\n" ] }, { @@ -482,8 +448,8 @@ "output_type": "stream", "text": [ "✅ Stage:1 completed successfully\n", - "CPU times: user 4.1 s, sys: 1.17 s, total: 5.27 s\n", - "Wall time: 28.2 s\n" + "CPU times: user 4.46 s, sys: 1.22 s, total: 5.69 s\n", + "Wall time: 30.4 s\n" ] } ], @@ -559,10 +525,10 @@ "metadata": { "colab": { "base_uri": "https://localhost:8080/", - "height": 254 + "height": 255 }, "id": "fe59563d", - "outputId": "9ba799f3-a183-4467-d50f-44dbbc86d19a" + "outputId": "40c31bad-d00a-4da9-8169-9db1bcc47704" }, "outputs": [ { @@ -615,12 +581,12 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", " \n", " \n", @@ -630,12 +596,12 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", " \n", " \n", @@ -648,16 +614,16 @@ "1 earth.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", "\n", " num_tables num_doc_elements document_id ext \\\n", - "0 0 11 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "1 0 11 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + "0 0 11 f20aa513-8473-4bf7-a746-a66eb28b722c pdf \n", + "1 0 11 b4c44875-3612-4c5a-b387-2f04c63d1276 pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", "1 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \n", - "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "1 2024-09-18T18:49:45.937701 1.966178 earth.pdf " + "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "1 2024-10-16T22:46:02.131556 2.001925 earth.pdf " ] }, "execution_count": 10, @@ -708,7 +674,7 @@ "base_uri": "https://localhost:8080/" }, "id": "f870e624", - "outputId": "e759dddf-64ac-4b55-a9bf-d0722620d6ab" + "outputId": "fd259342-158a-4a33-f148-d8462e2f1ca2" }, "outputs": [ { @@ -860,7 +826,7 @@ "base_uri": "https://localhost:8080/" }, "id": "e1a10c2d", - "outputId": "d9eab8cc-79ac-4f5e-99f3-596e357a2e39" + "outputId": "68cdc0c0-3bf5-45a2-d2bc-99aa79e3e0d5" }, "outputs": [ { @@ -1034,7 +1000,7 @@ "base_uri": "https://localhost:8080/" }, "id": "305f00a3", - "outputId": "d680cc28-2d3a-4793-9373-c56635a308c9" + "outputId": "7a800f4b-bc80-452d-c3d6-170e19f3422e" }, "outputs": [ { @@ -1075,32 +1041,32 @@ "base_uri": "https://localhost:8080/" }, "id": "5b7b18d5", - "outputId": "7151d997-74f1-42fd-90a2-0124c6a68c84" + "outputId": "e6f06879-906c-47d0-ef34-b018e4efa00f" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "18:49:58 INFO - Running locally\n", - "18:49:58 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'output_chunk_column_name': 'contents', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox'}\n", - "18:49:58 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", - "18:49:58 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:49:58 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "18:49:58 INFO - pipeline id pipeline_id\n", - "18:49:58 INFO - code location None\n", - "18:49:58 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "18:49:58 INFO - actor creation delay 0\n", - "18:49:58 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_chunk', 'job type': 'ray', 'job id': 'job_id'}\n", - "2024-09-18 18:50:00,178\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:02 INFO - orchestrator started at 2024-09-18 18:50:02\n", - "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:02 INFO - Number of files is 2, source profile {'max_file_size': 0.02239513397216797, 'min_file_size': 0.02167987823486328, 'total_file_size': 0.04407501220703125}\n", - "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:02 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.085193634033203, 'object_store': 4.042596817016602}\n", - "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:02 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:02 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:04 INFO - Completed processing 2 files in 0.033 min\n", - "\u001b[36m(orchestrate pid=1213075)\u001b[0m 18:50:04 INFO - done flushing in 0.001 sec\n", - "18:50:14 INFO - Completed execution in 0.271 min, execution result 0\n" + "22:46:15 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'dl_min_chunk_len': None, 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox', 'chunk_size_tokens': 128, 'chunk_overlap_tokens': 30}\n", + "22:46:15 INFO - pipeline id pipeline_id\n", + "22:46:15 INFO - code location None\n", + "22:46:15 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", + "22:46:15 INFO - actor creation delay 0\n", + "22:46:15 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_chunk', 'job type': 'ray', 'job id': 'job_id'}\n", + "22:46:15 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", + "22:46:15 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:46:15 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "22:46:15 INFO - Running locally\n", + "2024-10-16 22:46:16,484\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:19 INFO - orchestrator started at 2024-10-16 22:46:19\n", + "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:19 INFO - Number of files is 2, source profile {'max_file_size': 0.02239513397216797, 'min_file_size': 0.02167987823486328, 'total_file_size': 0.04407501220703125}\n", + "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:19 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.136235047131777, 'object_store': 3.068117522634566}\n", + "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:19 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:21 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:21 INFO - Completed processing 2 files in 0.0 min\n", + "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:21 INFO - done flushing in 0.001 sec\n", + "22:46:31 INFO - Completed execution in 0.271 min, execution result 0\n" ] }, { @@ -1108,8 +1074,8 @@ "output_type": "stream", "text": [ "✅ Stage:2 completed successfully\n", - "CPU times: user 917 ms, sys: 285 ms, total: 1.2 s\n", - "Wall time: 18.6 s\n" + "CPU times: user 1.04 s, sys: 360 ms, total: 1.4 s\n", + "Wall time: 19.1 s\n" ] } ], @@ -1171,10 +1137,10 @@ "metadata": { "colab": { "base_uri": "https://localhost:8080/", - "height": 893 + "height": 897 }, "id": "d8138d43", - "outputId": "3cbc98f8-1dcb-4a32-9259-f801a83cf241" + "outputId": "3e040b55-8c94-4f97-fedf-d2dbead55a72" }, "outputs": [ { @@ -1184,7 +1150,7 @@ "Files processed : 2\n", "Chunks created : 8\n", "Input data dimensions (rows x columns)= (2, 12)\n", - "Output data dimensions (rows x columns)= (8, 15)\n" + "Output data dimensions (rows x columns)= (8, 16)\n" ] }, { @@ -1212,17 +1178,18 @@ " num_pages\n", " num_tables\n", " num_doc_elements\n", - " document_id\n", " ext\n", " hash\n", " size\n", " date_acquired\n", " pdf_convert_time\n", " source_filename\n", + " source_document_id\n", " contents\n", " doc_jsonpath\n", " page_number\n", " bbox\n", + " document_id\n", " \n", " \n", " \n", @@ -1232,17 +1199,18 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.84518433, 588.96014404, 479.40917969, 623...\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", " \n", " \n", " 1\n", @@ -1250,17 +1218,18 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Solar System\\nFor more details about the Solar...\n", " $.main-text[3]\n", " 1\n", " [133.18510437, 570.83258057, 374.99838257, 581...\n", + " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", " \n", " \n", " 2\n", @@ -1268,17 +1237,18 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", " [132.87440491, 500.84011841, 477.48345947, 534...\n", + " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", " \n", " \n", " 3\n", @@ -1286,17 +1256,18 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Basic facts about Mars:\\n· Distance from the S...\n", " $.main-text[6]\n", " 1\n", " [133.2026062, 482.90710449, 237.04431152, 493....\n", + " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", " \n", " \n", " 4\n", @@ -1304,17 +1275,18 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.87112427, 588.96014404, 479.40917969, 623...\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", " \n", " \n", " 5\n", @@ -1322,17 +1294,18 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", " [133.20942688, 570.81555176, 375.57919312, 581...\n", + " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", " \n", " \n", " 6\n", @@ -1340,17 +1313,18 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", " [132.91053772, 512.46295166, 477.84887695, 534...\n", + " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", " \n", " \n", " 7\n", @@ -1358,42 +1332,33 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Earth\\nBasic facts about Earth:\\n· Distance fr...\n", " $.main-text[6]\n", " 1\n", " [133.30151367, 494.86206055, 240.17156982, 505...\n", + " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", " \n", " \n", "\n", "" ], "text/plain": [ - " filename num_pages num_tables num_doc_elements \\\n", - "0 mars.pdf 1 0 11 \n", - "1 mars.pdf 1 0 11 \n", - "2 mars.pdf 1 0 11 \n", - "3 mars.pdf 1 0 11 \n", - "4 earth.pdf 1 0 11 \n", - "5 earth.pdf 1 0 11 \n", - "6 earth.pdf 1 0 11 \n", - "7 earth.pdf 1 0 11 \n", - "\n", - " document_id ext \\\n", - "0 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "1 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "2 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "3 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "4 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", - "5 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", - "6 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", - "7 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 mars.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", + "6 earth.pdf 1 0 11 pdf \n", + "7 earth.pdf 1 0 11 pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", @@ -1406,14 +1371,24 @@ "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "1 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "2 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "3 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "4 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", - "5 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", - "6 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", - "7 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "1 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "2 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "3 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "4 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "5 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "6 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "7 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "1 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "2 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "3 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "4 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "5 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "6 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "7 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", @@ -1425,15 +1400,25 @@ "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", "7 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", "\n", - " page_number bbox \n", - "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", - "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", - "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", - "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", - "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", - "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", - "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", - "7 1 [133.30151367, 494.86206055, 240.17156982, 505... " + " page_number bbox \\\n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", + "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", + "\n", + " document_id \n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", + "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", + "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... " ] }, "execution_count": 15, @@ -1481,7 +1466,7 @@ "height": 300 }, "id": "3090c950", - "outputId": "fa82f54b-53a3-4447-a4ca-2fe92dea452a" + "outputId": "4c3b6461-ae8c-41d9-8c71-e1bbe634b9ed" }, "outputs": [ { @@ -1584,7 +1569,7 @@ "base_uri": "https://localhost:8080/" }, "id": "d5f151ae", - "outputId": "87a8d7a0-0bc0-4735-9edb-57e9c9e5a8e1" + "outputId": "3dc3ec5d-31d7-4081-db16-8bb6051ea80a" }, "outputs": [ { @@ -1644,7 +1629,9 @@ { "cell_type": "markdown", "id": "20217298", - "metadata": {}, + "metadata": { + "id": "20217298" + }, "source": [ "## Step-5: DOC ID generation\n", "\n", @@ -1659,7 +1646,9 @@ { "cell_type": "markdown", "id": "66811f5b", - "metadata": {}, + "metadata": { + "id": "66811f5b" + }, "source": [ "### 5.1 - Set Input/output Folder" ] @@ -1668,7 +1657,13 @@ "cell_type": "code", "execution_count": 18, "id": "1f747c0d", - "metadata": {}, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "1f747c0d", + "outputId": "765daa01-138b-4bfa-a75c-bffc80f9e246" + }, "outputs": [ { "name": "stdout", @@ -1696,7 +1691,9 @@ { "cell_type": "markdown", "id": "18aa0fe1", - "metadata": {}, + "metadata": { + "id": "18aa0fe1" + }, "source": [ "### 5.2 - Execute" ] @@ -1705,31 +1702,38 @@ "cell_type": "code", "execution_count": 19, "id": "f6e9e145", - "metadata": {}, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 883 + }, + "id": "f6e9e145", + "outputId": "fe3d0a3d-0575-4dd8-8564-e336a6ddb68d" + }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "18:50:16 INFO - Running locally\n", - "18:50:16 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", - "18:50:16 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", - "18:50:16 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:50:16 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "18:50:16 INFO - pipeline id pipeline_id\n", - "18:50:16 INFO - code location None\n", - "18:50:16 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "18:50:16 INFO - actor creation delay 0\n", - "18:50:16 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_id', 'job type': 'ray', 'job id': 'job_id'}\n", - "2024-09-18 18:50:17,977\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - orchestrator started at 2024-09-18 18:50:19\n", - "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - Number of files is 2, source profile {'max_file_size': 0.008135795593261719, 'min_file_size': 0.008058547973632812, 'total_file_size': 0.01619434356689453}\n", - "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.074102020822465, 'object_store': 4.037051009945571}\n", - "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - Completed processing 2 files in 0.013 min\n", - "\u001b[36m(orchestrate pid=1214633)\u001b[0m 18:50:19 INFO - done flushing in 0.001 sec\n", - "18:50:29 INFO - Completed execution in 0.231 min, execution result 0\n" + "22:46:32 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", + "22:46:32 INFO - pipeline id pipeline_id\n", + "22:46:32 INFO - code location None\n", + "22:46:32 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", + "22:46:32 INFO - actor creation delay 0\n", + "22:46:32 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_id', 'job type': 'ray', 'job id': 'job_id'}\n", + "22:46:32 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", + "22:46:32 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:46:32 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "22:46:32 INFO - Running locally\n", + "2024-10-16 22:46:33,897\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:35 INFO - orchestrator started at 2024-10-16 22:46:35\n", + "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:35 INFO - Number of files is 2, source profile {'max_file_size': 0.008975982666015625, 'min_file_size': 0.008897781372070312, 'total_file_size': 0.017873764038085938}\n", + "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:35 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.126107025891542, 'object_store': 3.0630535120144486}\n", + "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:35 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:36 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:36 INFO - Completed processing 2 files in 0.003 min\n", + "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:36 INFO - done flushing in 0.001 sec\n", + "22:46:46 INFO - Completed execution in 0.227 min, execution result 0\n" ] }, { @@ -1737,8 +1741,8 @@ "output_type": "stream", "text": [ "✅ Stage:3 completed successfully\n", - "CPU times: user 107 ms, sys: 137 ms, total: 244 ms\n", - "Wall time: 15.1 s\n" + "CPU times: user 122 ms, sys: 153 ms, total: 276 ms\n", + "Wall time: 14.9 s\n" ] } ], @@ -1783,7 +1787,9 @@ { "cell_type": "markdown", "id": "4954402f", - "metadata": {}, + "metadata": { + "id": "4954402f" + }, "source": [ "### 5.3 - Inspect Generated output\n", "\n", @@ -1799,14 +1805,21 @@ "cell_type": "code", "execution_count": 20, "id": "1911179a", - "metadata": {}, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 373 + }, + "id": "1911179a", + "outputId": "b82445e8-ebba-48fa-b1c2-26a9e0743ef9" + }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Input data dimensions (rows x columns)= (8, 15)\n", - "Output data dimensions (rows x columns)= (8, 17)\n" + "Input data dimensions (rows x columns)= (8, 16)\n", + "Output data dimensions (rows x columns)= (8, 18)\n" ] }, { @@ -1834,17 +1847,18 @@ " num_pages\n", " num_tables\n", " num_doc_elements\n", - " document_id\n", " ext\n", " hash\n", " size\n", " date_acquired\n", " pdf_convert_time\n", " source_filename\n", + " source_document_id\n", " contents\n", " doc_jsonpath\n", " page_number\n", " bbox\n", + " document_id\n", " chunk_hash\n", " chunk_id\n", " \n", @@ -1856,19 +1870,20 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.84518433, 588.96014404, 479.40917969, 623...\n", " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", - " 0\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", + " 4\n", " \n", " \n", " 1\n", @@ -1876,19 +1891,20 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Solar System\\nFor more details about the Solar...\n", " $.main-text[3]\n", " 1\n", " [133.18510437, 570.83258057, 374.99838257, 581...\n", " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", - " 1\n", + " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", + " 5\n", " \n", " \n", " 2\n", @@ -1896,19 +1912,20 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", " [132.87440491, 500.84011841, 477.48345947, 534...\n", " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", - " 2\n", + " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", + " 6\n", " \n", " \n", " 3\n", @@ -1916,19 +1933,20 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Basic facts about Mars:\\n· Distance from the S...\n", " $.main-text[6]\n", " 1\n", " [133.2026062, 482.90710449, 237.04431152, 493....\n", " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", - " 3\n", + " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", + " 7\n", " \n", " \n", " 4\n", @@ -1936,19 +1954,20 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.87112427, 588.96014404, 479.40917969, 623...\n", " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", - " 4\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", + " 0\n", " \n", " \n", " 5\n", @@ -1956,19 +1975,20 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", " [133.20942688, 570.81555176, 375.57919312, 581...\n", " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", - " 5\n", + " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", + " 1\n", " \n", " \n", " 6\n", @@ -1976,19 +1996,20 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", " [132.91053772, 512.46295166, 477.84887695, 534...\n", " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", - " 6\n", + " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", + " 2\n", " \n", " \n", " 7\n", @@ -1996,44 +2017,35 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Earth\\nBasic facts about Earth:\\n· Distance fr...\n", " $.main-text[6]\n", " 1\n", " [133.30151367, 494.86206055, 240.17156982, 505...\n", " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", - " 7\n", + " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", + " 3\n", " \n", " \n", "\n", "" ], "text/plain": [ - " filename num_pages num_tables num_doc_elements \\\n", - "0 mars.pdf 1 0 11 \n", - "1 mars.pdf 1 0 11 \n", - "2 mars.pdf 1 0 11 \n", - "3 mars.pdf 1 0 11 \n", - "4 earth.pdf 1 0 11 \n", - "5 earth.pdf 1 0 11 \n", - "6 earth.pdf 1 0 11 \n", - "7 earth.pdf 1 0 11 \n", - "\n", - " document_id ext \\\n", - "0 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "1 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "2 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "3 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "4 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", - "5 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", - "6 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", - "7 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 mars.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", + "6 earth.pdf 1 0 11 pdf \n", + "7 earth.pdf 1 0 11 pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", @@ -2046,14 +2058,24 @@ "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "1 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "2 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "3 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "4 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", - "5 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", - "6 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", - "7 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "1 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "2 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "3 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "4 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "5 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "6 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "7 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "1 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "2 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "3 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "4 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "5 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "6 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "7 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", @@ -2075,15 +2097,25 @@ "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", "\n", + " document_id \\\n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", + "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", + "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", + "\n", " chunk_hash chunk_id \n", - "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", - "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 1 \n", - "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 2 \n", - "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 3 \n", - "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 \n", - "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 5 \n", - "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 6 \n", - "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 7 " + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", + "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", + "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", + "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", + "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 " ] }, "execution_count": 20, @@ -2105,7 +2137,9 @@ { "cell_type": "markdown", "id": "852829dc", - "metadata": {}, + "metadata": { + "id": "852829dc" + }, "source": [ "## Step-6: Exact Dedup\n", "\n" @@ -2126,11 +2160,7 @@ "execution_count": 21, "id": "4c7a1b94", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "4c7a1b94", - "outputId": "7998935d-3f72-4617-ea03-fd2a40ad9f23" + "id": "4c7a1b94" }, "outputs": [ { @@ -2167,36 +2197,32 @@ "execution_count": 22, "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", - "outputId": "aa460fea-a393-47d3-b084-59d47f26f0a7" + "id": "a624b2b2-faad-4325-ac7d-53a840f564ef" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "18:50:31 INFO - Running locally\n", - "18:50:31 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None, 'hash_cpu': 0.5, 'num_hashes': 2}\n", - "18:50:31 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", - "18:50:31 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:50:31 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "18:50:31 INFO - pipeline id pipeline_id\n", - "18:50:31 INFO - code location None\n", - "18:50:31 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "18:50:31 INFO - actor creation delay 0\n", - "18:50:31 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}\n", - "2024-09-18 18:50:33,176\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:34 INFO - orchestrator started at 2024-09-18 18:50:34\n", - "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:34 INFO - Number of files is 2, source profile {'max_file_size': 0.009340286254882812, 'min_file_size': 0.0092620849609375, 'total_file_size': 0.018602371215820312}\n", - "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:34 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.064273834228516, 'object_store': 4.032136917114258}\n", - "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:34 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:34 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:35 INFO - Completed processing 2 files in 0.014 min\n", - "\u001b[36m(orchestrate pid=1216179)\u001b[0m 18:50:35 INFO - done flushing in 0.001 sec\n", - "18:50:45 INFO - Completed execution in 0.23 min, execution result 0\n" + "22:46:47 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None, 'hash_cpu': 0.5, 'num_hashes': 2}\n", + "22:46:47 INFO - pipeline id pipeline_id\n", + "22:46:47 INFO - code location None\n", + "22:46:47 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", + "22:46:47 INFO - actor creation delay 0\n", + "22:46:47 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}\n", + "22:46:47 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", + "22:46:47 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:46:47 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "22:46:47 INFO - Running locally\n", + "2024-10-16 22:46:48,851\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:50 INFO - orchestrator started at 2024-10-16 22:46:50\n", + "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:50 INFO - Number of files is 2, source profile {'max_file_size': 0.010180473327636719, 'min_file_size': 0.010101318359375, 'total_file_size': 0.02028179168701172}\n", + "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:50 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.11034622322768, 'object_store': 3.055173110216856}\n", + "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:50 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:51 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:51 INFO - Completed processing 2 files in 0.003 min\n", + "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:51 INFO - done flushing in 0.001 sec\n", + "22:47:01 INFO - Completed execution in 0.226 min, execution result 0\n" ] }, { @@ -2204,8 +2230,8 @@ "output_type": "stream", "text": [ "✅ Stage:4 completed successfully\n", - "CPU times: user 99.9 ms, sys: 168 ms, total: 268 ms\n", - "Wall time: 15.1 s\n" + "CPU times: user 125 ms, sys: 134 ms, total: 259 ms\n", + "Wall time: 15 s\n" ] } ], @@ -2266,20 +2292,15 @@ "execution_count": 23, "id": "d824ebf6", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 358 - }, - "id": "d824ebf6", - "outputId": "89f1013d-6dcf-418f-a0d7-5f78b19b74ac" + "id": "d824ebf6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Input data dimensions (rows x columns)= (8, 17)\n", - "Output data dimensions (rows x columns)= (7, 18)\n", + "Input data dimensions (rows x columns)= (8, 18)\n", + "Output data dimensions (rows x columns)= (7, 19)\n", "Input chunks before exact dedupe : 8\n", "Output chunks after exact dedupe : 7\n", "Duplicate chunks removed : 1\n" @@ -2310,17 +2331,18 @@ " num_pages\n", " num_tables\n", " num_doc_elements\n", - " document_id\n", " ext\n", " hash\n", " size\n", " date_acquired\n", " pdf_convert_time\n", " source_filename\n", + " source_document_id\n", " contents\n", " doc_jsonpath\n", " page_number\n", " bbox\n", + " document_id\n", " chunk_hash\n", " chunk_id\n", " removed\n", @@ -2333,19 +2355,20 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.84518433, 588.96014404, 479.40917969, 623...\n", " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", - " 0\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", + " 4\n", " []\n", " \n", " \n", @@ -2354,19 +2377,20 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Solar System\\nFor more details about the Solar...\n", " $.main-text[3]\n", " 1\n", " [133.18510437, 570.83258057, 374.99838257, 581...\n", " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", - " 1\n", + " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", + " 5\n", " []\n", " \n", " \n", @@ -2375,19 +2399,20 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", " [132.87440491, 500.84011841, 477.48345947, 534...\n", " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", - " 2\n", + " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", + " 6\n", " []\n", " \n", " \n", @@ -2396,19 +2421,20 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Basic facts about Mars:\\n· Distance from the S...\n", " $.main-text[6]\n", " 1\n", " [133.2026062, 482.90710449, 237.04431152, 493....\n", " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", - " 3\n", + " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", + " 7\n", " []\n", " \n", " \n", @@ -2417,19 +2443,20 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", " [133.20942688, 570.81555176, 375.57919312, 581...\n", " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", - " 5\n", + " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", + " 1\n", " [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...\n", " \n", " \n", @@ -2438,19 +2465,20 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", " [132.91053772, 512.46295166, 477.84887695, 534...\n", " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", - " 6\n", + " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", + " 2\n", " []\n", " \n", " \n", @@ -2459,19 +2487,20 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Earth\\nBasic facts about Earth:\\n· Distance fr...\n", " $.main-text[6]\n", " 1\n", " [133.30151367, 494.86206055, 240.17156982, 505...\n", " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", - " 7\n", + " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", + " 3\n", " []\n", " \n", " \n", @@ -2479,23 +2508,14 @@ "" ], "text/plain": [ - " filename num_pages num_tables num_doc_elements \\\n", - "0 mars.pdf 1 0 11 \n", - "1 mars.pdf 1 0 11 \n", - "2 mars.pdf 1 0 11 \n", - "3 mars.pdf 1 0 11 \n", - "4 earth.pdf 1 0 11 \n", - "5 earth.pdf 1 0 11 \n", - "6 earth.pdf 1 0 11 \n", - "\n", - " document_id ext \\\n", - "0 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "1 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "2 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "3 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "4 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", - "5 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", - "6 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 mars.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", + "6 earth.pdf 1 0 11 pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", @@ -2507,13 +2527,22 @@ "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "1 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "2 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "3 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "4 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", - "5 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", - "6 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "1 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "2 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "3 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "4 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "5 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "6 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "1 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "2 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "3 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "4 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "5 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "6 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", @@ -2533,14 +2562,23 @@ "5 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", "6 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", "\n", + " document_id \\\n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", + "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", + "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", + "\n", " chunk_hash chunk_id \\\n", - "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", - "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 1 \n", - "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 2 \n", - "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 3 \n", - "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 5 \n", - "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 6 \n", - "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 7 \n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", + "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", + "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", + "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 \n", "\n", " removed \n", "0 [] \n", @@ -2576,12 +2614,7 @@ "execution_count": 24, "id": "82cc9bb0", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 112 - }, - "id": "82cc9bb0", - "outputId": "293489a5-a840-4d5c-fafd-245db30d81c0" + "id": "82cc9bb0" }, "outputs": [ { @@ -2674,11 +2707,7 @@ "execution_count": 25, "id": "cc61dffa", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "cc61dffa", - "outputId": "cf6393e6-c4c7-4606-87e5-892c26b28801" + "id": "cc61dffa" }, "outputs": [ { @@ -2781,11 +2810,7 @@ "execution_count": 26, "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399", - "outputId": "4548fff6-f86f-45d4-a812-49aa061fdef2" + "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399" }, "outputs": [ { @@ -2824,60 +2849,56 @@ "execution_count": 27, "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f", - "outputId": "1164345a-93db-4f8e-ad34-58a1c3d0c116" + "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "18:50:46 INFO - Running locally\n", - "18:50:46 INFO - fuzzy dedup params are {'doc_column': 'contents', 'id_column': 'chunk_id', 'cluster_column': 'chunk_hash', 'bucket_cpu': 0.3, 'mhash_cpu': 0.3, 'doc_cpu': 0.3, 'num_doc_actors': 1, 'num_minhash_actors': 1, 'num_bucket_actors': 1, 'num_preprocessors': 1, 'num_permutations': 64, 'threshold': 0.7, 'shingles_size': 5, 'delimiters': ' ', 'snapshot_delay': 1, 'use_bucket_snapshot': False, 'use_doc_snapshot': False, 'random_delay_limit': 10, 'worker_options': {'num_cpus': 1}}\n", - "18:50:46 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/05_fuzzy_dedupe_out\n", - "18:50:46 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:50:46 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "18:50:46 INFO - pipeline id pipeline_id\n", - "18:50:46 INFO - code location None\n", - "18:50:46 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "18:50:46 INFO - actor creation delay 0\n", - "18:50:46 INFO - job details {'job category': 'preprocessing', 'job name': 'fdedup', 'job type': 'ray', 'job id': 'job_id'}\n", - "2024-09-18 18:50:48,381\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - orchestrator started at 2024-09-18 18:50:49\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - Number of files is 2, source profile {'max_file_size': 0.009340286254882812, 'min_file_size': 0.0092620849609375, 'total_file_size': 0.018602371215820312}\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.067702485248446, 'object_store': 4.033851241692901}\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - starting run from the beginning\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - continuing from the very beginning\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - Fuzzy: num buckets 8, bucket length 8\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - created 1 bucket actors\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - created 1 minhash actors\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - Table preprocessing uses 1 readers\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:49 INFO - created 1 table processor actors\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:57 INFO - Completed 1 files in 0.131 min\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:50:57 INFO - Completed 1 files (50.0%) in 0.131 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:02 INFO - Completed processing 2 files in 0.215 min\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:02 INFO - creating minhash snapshots\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:03 INFO - minhash snapshots created\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:03 INFO - creating bucket snapshots\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:04 INFO - bucket snapshots created\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:04 INFO - created 1 document actors\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:04 INFO - created 1 bucket processor actors\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:04 INFO - created bucket processor invoker\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:04 INFO - added invoker to bucket collectors\n", - "\u001b[36m(BucketsHash pid=1218636)\u001b[0m 18:51:04 INFO - processing buckets 0 long, 53 short\n", - "\u001b[36m(BucketsHash pid=1218636)\u001b[0m 18:51:04 INFO - Done submitting long buckets\n", - "\u001b[36m(BucketsHashProcessorInvoker pid=1219171)\u001b[0m 18:51:05 INFO - Waiting bucket processing completion. Submitted requests 1\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:05 INFO - Done processing buckets in 0.011 min\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:05 INFO - creating document snapshots\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:06 INFO - document snapshots created\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:06 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:12 INFO - Completed processing 2 files in 0.098 min\n", - "\u001b[36m(orchestrate pid=1217793)\u001b[0m 18:51:12 INFO - done flushing in 0.001 sec\n", - "18:51:22 INFO - Completed execution in 0.592 min, execution result 0\n" + "22:47:02 INFO - fuzzy dedup params are {'doc_column': 'contents', 'id_column': 'chunk_id', 'cluster_column': 'chunk_hash', 'bucket_cpu': 0.3, 'mhash_cpu': 0.3, 'doc_cpu': 0.3, 'num_doc_actors': 1, 'num_minhash_actors': 1, 'num_bucket_actors': 1, 'num_preprocessors': 1, 'num_permutations': 64, 'threshold': 0.7, 'shingles_size': 5, 'delimiters': ' ', 'snapshot_delay': 1, 'use_bucket_snapshot': False, 'use_doc_snapshot': False, 'random_delay_limit': 10, 'worker_options': {'num_cpus': 1}}\n", + "22:47:02 INFO - pipeline id pipeline_id\n", + "22:47:02 INFO - code location None\n", + "22:47:02 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", + "22:47:02 INFO - actor creation delay 0\n", + "22:47:02 INFO - job details {'job category': 'preprocessing', 'job name': 'fdedup', 'job type': 'ray', 'job id': 'job_id'}\n", + "22:47:02 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/05_fuzzy_dedupe_out\n", + "22:47:02 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:47:02 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "22:47:02 INFO - Running locally\n", + "2024-10-16 22:47:03,977\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - orchestrator started at 2024-10-16 22:47:05\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - Number of files is 2, source profile {'max_file_size': 0.010180473327636719, 'min_file_size': 0.010101318359375, 'total_file_size': 0.02028179168701172}\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.128299713134766, 'object_store': 3.064149856567383}\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - starting run from the beginning\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - continuing from the very beginning\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - Fuzzy: num buckets 8, bucket length 8\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - created 1 bucket actors\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - created 1 minhash actors\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - Table preprocessing uses 1 readers\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:06 INFO - created 1 table processor actors\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:12 INFO - Completed 1 files in 0.104 min\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:12 INFO - Completed 1 files (50.0%) in 0.104 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:15 INFO - Completed processing 2 files in 0.154 min\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:15 INFO - creating minhash snapshots\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:16 INFO - minhash snapshots created\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:16 INFO - creating bucket snapshots\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:17 INFO - bucket snapshots created\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:17 INFO - created 1 document actors\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:18 INFO - created 1 bucket processor actors\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:18 INFO - created bucket processor invoker\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:18 INFO - added invoker to bucket collectors\n", + "\u001b[36m(BucketsHash pid=1008361)\u001b[0m 22:47:18 INFO - processing buckets 0 long, 53 short\n", + "\u001b[36m(BucketsHash pid=1008361)\u001b[0m 22:47:18 INFO - Done submitting long buckets\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:19 INFO - Done processing buckets in 0.012 min\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:19 INFO - creating document snapshots\n", + "\u001b[36m(BucketsHashProcessorInvoker pid=1008950)\u001b[0m 22:47:19 INFO - Waiting bucket processing completion. Submitted requests 1\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:20 INFO - document snapshots created\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:21 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:30 INFO - Completed processing 2 files in 0.153 min\n", + "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:30 INFO - done flushing in 0.001 sec\n", + "22:47:40 INFO - Completed execution in 0.632 min, execution result 0\n" ] }, { @@ -2885,8 +2906,8 @@ "output_type": "stream", "text": [ "✅ Stage:5 completed successfully\n", - "CPU times: user 174 ms, sys: 166 ms, total: 341 ms\n", - "Wall time: 36.7 s\n" + "CPU times: user 212 ms, sys: 201 ms, total: 413 ms\n", + "Wall time: 39.4 s\n" ] } ], @@ -2965,20 +2986,15 @@ "execution_count": 28, "id": "e899ad60", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 222 - }, - "id": "e899ad60", - "outputId": "70d040ab-b1d5-4797-f725-11982ef82413" + "id": "e899ad60" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Input data dimensions (rows x columns)= (8, 17)\n", - "Output data dimensions (rows x columns)= (6, 17)\n", + "Input data dimensions (rows x columns)= (8, 18)\n", + "Output data dimensions (rows x columns)= (6, 18)\n", "Duplicate chunks removed by fuzzy-dedupe: 2\n" ] }, @@ -3007,17 +3023,18 @@ " num_pages\n", " num_tables\n", " num_doc_elements\n", - " document_id\n", " ext\n", " hash\n", " size\n", " date_acquired\n", " pdf_convert_time\n", " source_filename\n", + " source_document_id\n", " contents\n", " doc_jsonpath\n", " page_number\n", " bbox\n", + " document_id\n", " chunk_id\n", " chunk_hash\n", " \n", @@ -3029,19 +3046,20 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.84518433, 588.96014404, 479.40917969, 623...\n", - " 0\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", " 4\n", + " -1\n", " \n", " \n", " 1\n", @@ -3049,19 +3067,20 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", - " Solar System\\nFor more details about the Solar...\n", - " $.main-text[3]\n", - " 1\n", - " [133.18510437, 570.83258057, 374.99838257, 581...\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " Mars\\nMars, the fourth planet from the Sun, is...\n", + " $.main-text[5]\n", " 1\n", - " 5\n", + " [132.87440491, 500.84011841, 477.48345947, 534...\n", + " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", + " 6\n", + " -1\n", " \n", " \n", " 2\n", @@ -3069,39 +3088,41 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", - " Mars\\nMars, the fourth planet from the Sun, is...\n", - " $.main-text[5]\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " Basic facts about Mars:\\n· Distance from the S...\n", + " $.main-text[6]\n", " 1\n", - " [132.87440491, 500.84011841, 477.48345947, 534...\n", - " 2\n", + " [133.2026062, 482.90710449, 237.04431152, 493....\n", + " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", + " 7\n", " -1\n", " \n", " \n", " 3\n", - " mars.pdf\n", + " earth.pdf\n", " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", - " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", - " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", - " mars.pdf\n", - " Basic facts about Mars:\\n· Distance from the S...\n", - " $.main-text[6]\n", + " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", + " 2686\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", + " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " Solar System\\nFor more details about our Solar...\n", + " $.main-text[3]\n", " 1\n", - " [133.2026062, 482.90710449, 237.04431152, 493....\n", - " 3\n", - " -1\n", + " [133.20942688, 570.81555176, 375.57919312, 581...\n", + " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", + " 1\n", + " 5\n", " \n", " \n", " 4\n", @@ -3109,18 +3130,19 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", " [132.91053772, 512.46295166, 477.84887695, 534...\n", - " 6\n", + " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", + " 2\n", " -1\n", " \n", " \n", @@ -3129,18 +3151,19 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Earth\\nBasic facts about Earth:\\n· Distance fr...\n", " $.main-text[6]\n", " 1\n", " [133.30151367, 494.86206055, 240.17156982, 505...\n", - " 7\n", + " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", + " 3\n", " -1\n", " \n", " \n", @@ -3148,61 +3171,61 @@ "" ], "text/plain": [ - " filename num_pages num_tables num_doc_elements \\\n", - "0 mars.pdf 1 0 11 \n", - "1 mars.pdf 1 0 11 \n", - "2 mars.pdf 1 0 11 \n", - "3 mars.pdf 1 0 11 \n", - "4 earth.pdf 1 0 11 \n", - "5 earth.pdf 1 0 11 \n", - "\n", - " document_id ext \\\n", - "0 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "1 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "2 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "3 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "4 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", - "5 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 earth.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "1 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "2 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "3 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "4 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", - "5 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "1 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "2 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "3 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "4 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "5 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "1 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "2 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "3 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "4 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "5 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", - "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "3 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", + "1 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "2 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", + "3 Solar System\\nFor more details about our Solar... $.main-text[3] \n", "4 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", "5 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", "\n", - " page_number bbox chunk_id \\\n", - "0 1 [132.84518433, 588.96014404, 479.40917969, 623... 0 \n", - "1 1 [133.18510437, 570.83258057, 374.99838257, 581... 1 \n", - "2 1 [132.87440491, 500.84011841, 477.48345947, 534... 2 \n", - "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... 3 \n", - "4 1 [132.91053772, 512.46295166, 477.84887695, 534... 6 \n", - "5 1 [133.30151367, 494.86206055, 240.17156982, 505... 7 \n", + " page_number bbox \\\n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", + "1 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "2 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "3 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "4 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "5 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", "\n", - " chunk_hash \n", - "0 4 \n", - "1 5 \n", - "2 -1 \n", - "3 -1 \n", - "4 -1 \n", - "5 -1 " + " document_id chunk_id chunk_hash \n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 -1 \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 -1 \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 -1 \n", + "3 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 5 \n", + "4 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 -1 \n", + "5 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 -1 " ] }, "execution_count": 28, @@ -3227,12 +3250,7 @@ "execution_count": 29, "id": "ab7ea52b", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 81 - }, - "id": "ab7ea52b", - "outputId": "13a1847a-bdd1-4dc9-a281-a8faac59c3a8" + "id": "ab7ea52b" }, "outputs": [ { @@ -3269,17 +3287,17 @@ " \n", " 1\n", " mars.pdf\n", - " Solar System\\nFor more details about the Solar...\n", + " Mars\\nMars, the fourth planet from the Sun, is...\n", " \n", " \n", " 2\n", " mars.pdf\n", - " Mars\\nMars, the fourth planet from the Sun, is...\n", + " Basic facts about Mars:\\n· Distance from the S...\n", " \n", " \n", " 3\n", - " mars.pdf\n", - " Basic facts about Mars:\\n· Distance from the S...\n", + " earth.pdf\n", + " Solar System\\nFor more details about our Solar...\n", " \n", " \n", " 4\n", @@ -3298,9 +3316,9 @@ "text/plain": [ " filename contents\n", "0 mars.pdf Solar System\\nOur solar system is a vast and f...\n", - "1 mars.pdf Solar System\\nFor more details about the Solar...\n", - "2 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", - "3 mars.pdf Basic facts about Mars:\\n· Distance from the S...\n", + "1 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", + "2 mars.pdf Basic facts about Mars:\\n· Distance from the S...\n", + "3 earth.pdf Solar System\\nFor more details about our Solar...\n", "4 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", "5 earth.pdf Earth\\nBasic facts about Earth:\\n· Distance fr..." ] @@ -3319,11 +3337,7 @@ "execution_count": 30, "id": "6bdd3515", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "6bdd3515", - "outputId": "5a214fa3-c420-42d7-dcab-574b661e0cd8" + "id": "6bdd3515" }, "outputs": [ { @@ -3336,14 +3350,10 @@ "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", "-------\n", "-------Chunk 1------\n", - "Solar System\n", - "For more details about the Solar system see Chapter 1.\n", - "-------\n", - "-------Chunk 2------\n", "Mars\n", "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", "-------\n", - "-------Chunk 3------\n", + "-------Chunk 2------\n", "Basic facts about Mars:\n", "· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", "· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", @@ -3351,10 +3361,14 @@ "-------\n", "========== earth.pdf ===========\n", "-------Chunk 0------\n", + "Solar System\n", + "For more details about our Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 1------\n", "Earth\n", "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", "-------\n", - "-------Chunk 1------\n", + "-------Chunk 2------\n", "Earth\n", "Basic facts about Earth:\n", "· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", @@ -3437,11 +3451,7 @@ "execution_count": 31, "id": "20a153fa-fd56-401e-86be-4f7617affcc8", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "20a153fa-fd56-401e-86be-4f7617affcc8", - "outputId": "1c7835d1-1f2c-4545-8533-d9ab7a3ad0aa" + "id": "20a153fa-fd56-401e-86be-4f7617affcc8" }, "outputs": [ { @@ -3478,36 +3488,32 @@ "execution_count": 32, "id": "228df6b2-bc62-494b-9697-03ece98d7853", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "228df6b2-bc62-494b-9697-03ece98d7853", - "outputId": "91dd893c-3056-4d2a-bffe-49645e584a12" + "id": "228df6b2-bc62-494b-9697-03ece98d7853" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "18:51:23 INFO - Running locally\n", - "18:51:23 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", - "18:51:23 INFO - data factory data_ is using local data access: input_folder - output/05_fuzzy_dedupe_out output_folder - output/06_embeddings_out\n", - "18:51:23 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:51:23 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "18:51:23 INFO - pipeline id pipeline_id\n", - "18:51:23 INFO - code location None\n", - "18:51:23 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "18:51:23 INFO - actor creation delay 0\n", - "18:51:23 INFO - job details {'job category': 'preprocessing', 'job name': 'text_encoder', 'job type': 'ray', 'job id': 'job_id'}\n", - "2024-09-18 18:51:25,784\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:28 INFO - orchestrator started at 2024-09-18 18:51:28\n", - "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:28 INFO - Number of files is 2, source profile {'max_file_size': 0.008937835693359375, 'min_file_size': 0.00830841064453125, 'total_file_size': 0.017246246337890625}\n", - "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:28 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.01370926015079, 'object_store': 4.0068546291440725}\n", - "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:28 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:28 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:33 INFO - Completed processing 2 files in 0.084 min\n", - "\u001b[36m(orchestrate pid=1219965)\u001b[0m 18:51:34 INFO - done flushing in 0.001 sec\n", - "18:51:44 INFO - Completed execution in 0.334 min, execution result 0\n" + "22:47:42 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", + "22:47:42 INFO - pipeline id pipeline_id\n", + "22:47:42 INFO - code location None\n", + "22:47:42 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", + "22:47:42 INFO - actor creation delay 0\n", + "22:47:42 INFO - job details {'job category': 'preprocessing', 'job name': 'text_encoder', 'job type': 'ray', 'job id': 'job_id'}\n", + "22:47:42 INFO - data factory data_ is using local data access: input_folder - output/05_fuzzy_dedupe_out output_folder - output/06_embeddings_out\n", + "22:47:42 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:47:42 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "22:47:42 INFO - Running locally\n", + "2024-10-16 22:47:44,003\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:47 INFO - orchestrator started at 2024-10-16 22:47:47\n", + "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:47 INFO - Number of files is 2, source profile {'max_file_size': 0.009654045104980469, 'min_file_size': 0.00907135009765625, 'total_file_size': 0.01872539520263672}\n", + "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:47 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.101744843646884, 'object_store': 3.0508724208921194}\n", + "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:47 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:53 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:53 INFO - Completed processing 2 files in 0.011 min\n", + "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:53 INFO - done flushing in 0.001 sec\n", + "22:48:03 INFO - Completed execution in 0.349 min, execution result 0\n" ] }, { @@ -3515,8 +3521,8 @@ "output_type": "stream", "text": [ "✅ Stage:6 completed successfully\n", - "CPU times: user 611 ms, sys: 194 ms, total: 805 ms\n", - "Wall time: 22.1 s\n" + "CPU times: user 422 ms, sys: 241 ms, total: 663 ms\n", + "Wall time: 22.9 s\n" ] } ], @@ -3572,20 +3578,15 @@ "execution_count": 33, "id": "7b1c1d09", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 205 - }, - "id": "7b1c1d09", - "outputId": "9e695b9d-f196-4cb7-c56f-3789251e7860" + "id": "7b1c1d09" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Input data dimensions (rows x columns)= (6, 17)\n", - "Output data dimensions (rows x columns)= (6, 18)\n" + "Input data dimensions (rows x columns)= (6, 18)\n", + "Output data dimensions (rows x columns)= (6, 19)\n" ] }, { @@ -3613,17 +3614,18 @@ " num_pages\n", " num_tables\n", " num_doc_elements\n", - " document_id\n", " ext\n", " hash\n", " size\n", " date_acquired\n", " pdf_convert_time\n", " source_filename\n", + " source_document_id\n", " contents\n", " doc_jsonpath\n", " page_number\n", " bbox\n", + " document_id\n", " chunk_id\n", " chunk_hash\n", " embeddings\n", @@ -3636,19 +3638,20 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.84518433, 588.96014404, 479.40917969, 623...\n", - " 0\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", " 4\n", + " -1\n", " [0.0077404897, -0.020559434, 0.026426662, 0.01...\n", " \n", " \n", @@ -3657,81 +3660,85 @@ " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", - " mars.pdf\n", - " Solar System\\nFor more details about the Solar...\n", - " $.main-text[3]\n", - " 1\n", - " [133.18510437, 570.83258057, 374.99838257, 581...\n", - " 1\n", - " 5\n", - " [-0.051861413, 0.0035226392, 0.030617053, 0.04...\n", - " \n", - " \n", - " 2\n", - " mars.pdf\n", - " 1\n", - " 0\n", - " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", - " pdf\n", - " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", - " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", " [132.87440491, 500.84011841, 477.48345947, 534...\n", - " 2\n", + " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", + " 6\n", " -1\n", " [0.07728298, 0.024971062, -0.04318075, 0.05809...\n", " \n", " \n", - " 3\n", + " 2\n", " mars.pdf\n", " 1\n", " 0\n", " 11\n", - " 528221ef-005b-4df1-a057-84a012239ed0\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:49:46.009830\n", - " 2.004444\n", + " 2024-10-16T22:46:02.114286\n", + " 1.984612\n", " mars.pdf\n", + " f20aa513-8473-4bf7-a746-a66eb28b722c\n", " Basic facts about Mars:\\n· Distance from the S...\n", " $.main-text[6]\n", " 1\n", " [133.2026062, 482.90710449, 237.04431152, 493....\n", - " 3\n", + " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", + " 7\n", " -1\n", " [0.1059802, 0.025460616, 0.02362733, 0.0390564...\n", " \n", " \n", + " 3\n", + " earth.pdf\n", + " 1\n", + " 0\n", + " 11\n", + " pdf\n", + " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", + " 2686\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", + " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " Solar System\\nFor more details about our Solar...\n", + " $.main-text[3]\n", + " 1\n", + " [133.20942688, 570.81555176, 375.57919312, 581...\n", + " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", + " 1\n", + " 5\n", + " [-0.062105577, -0.0053322953, 0.03127779, 0.04...\n", + " \n", + " \n", " 4\n", " earth.pdf\n", " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", " [132.91053772, 512.46295166, 477.84887695, 534...\n", - " 6\n", + " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", + " 2\n", " -1\n", " [0.0724358, -0.058001805, -0.01977186, -0.0243...\n", " \n", @@ -3741,18 +3748,19 @@ " 1\n", " 0\n", " 11\n", - " 973d284f-30a5-464b-bfb9-28dacd2832f5\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:49:45.937701\n", - " 1.966178\n", + " 2024-10-16T22:46:02.131556\n", + " 2.001925\n", " earth.pdf\n", + " b4c44875-3612-4c5a-b387-2f04c63d1276\n", " Earth\\nBasic facts about Earth:\\n· Distance fr...\n", " $.main-text[6]\n", " 1\n", " [133.30151367, 494.86206055, 240.17156982, 505...\n", - " 7\n", + " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", + " 3\n", " -1\n", " [0.091821924, 0.015197907, 0.07716932, 0.01711...\n", " \n", @@ -3761,61 +3769,69 @@ "" ], "text/plain": [ - " filename num_pages num_tables num_doc_elements \\\n", - "0 mars.pdf 1 0 11 \n", - "1 mars.pdf 1 0 11 \n", - "2 mars.pdf 1 0 11 \n", - "3 mars.pdf 1 0 11 \n", - "4 earth.pdf 1 0 11 \n", - "5 earth.pdf 1 0 11 \n", - "\n", - " document_id ext \\\n", - "0 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "1 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "2 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "3 528221ef-005b-4df1-a057-84a012239ed0 pdf \n", - "4 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", - "5 973d284f-30a5-464b-bfb9-28dacd2832f5 pdf \n", + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 earth.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "1 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "2 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "3 2024-09-18T18:49:46.009830 2.004444 mars.pdf \n", - "4 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", - "5 2024-09-18T18:49:45.937701 1.966178 earth.pdf \n", + "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "1 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "2 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", + "3 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "4 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "5 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "1 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "2 f20aa513-8473-4bf7-a746-a66eb28b722c \n", + "3 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "4 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "5 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", - "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "3 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", + "1 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "2 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", + "3 Solar System\\nFor more details about our Solar... $.main-text[3] \n", "4 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", "5 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", "\n", - " page_number bbox chunk_id \\\n", - "0 1 [132.84518433, 588.96014404, 479.40917969, 623... 0 \n", - "1 1 [133.18510437, 570.83258057, 374.99838257, 581... 1 \n", - "2 1 [132.87440491, 500.84011841, 477.48345947, 534... 2 \n", - "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... 3 \n", - "4 1 [132.91053772, 512.46295166, 477.84887695, 534... 6 \n", - "5 1 [133.30151367, 494.86206055, 240.17156982, 505... 7 \n", + " page_number bbox \\\n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", + "1 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "2 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "3 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "4 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "5 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", "\n", - " chunk_hash embeddings \n", - "0 4 [0.0077404897, -0.020559434, 0.026426662, 0.01... \n", - "1 5 [-0.051861413, 0.0035226392, 0.030617053, 0.04... \n", - "2 -1 [0.07728298, 0.024971062, -0.04318075, 0.05809... \n", - "3 -1 [0.1059802, 0.025460616, 0.02362733, 0.0390564... \n", - "4 -1 [0.0724358, -0.058001805, -0.01977186, -0.0243... \n", - "5 -1 [0.091821924, 0.015197907, 0.07716932, 0.01711... " + " document_id chunk_id chunk_hash \\\n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 -1 \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 -1 \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 -1 \n", + "3 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 5 \n", + "4 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 -1 \n", + "5 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 -1 \n", + "\n", + " embeddings \n", + "0 [0.0077404897, -0.020559434, 0.026426662, 0.01... \n", + "1 [0.07728298, 0.024971062, -0.04318075, 0.05809... \n", + "2 [0.1059802, 0.025460616, 0.02362733, 0.0390564... \n", + "3 [-0.062105577, -0.0053322953, 0.03127779, 0.04... \n", + "4 [0.0724358, -0.058001805, -0.01977186, -0.0243... \n", + "5 [0.091821924, 0.015197907, 0.07716932, 0.01711... " ] }, "execution_count": 33, @@ -3849,11 +3865,7 @@ "execution_count": 34, "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", - "outputId": "e6a04d78-b8e9-431a-e9f5-1f9ad1aee3a7" + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207" }, "outputs": [ { @@ -3877,7 +3889,9 @@ "cell_type": "code", "execution_count": null, "id": "dc0a6728", - "metadata": {}, + "metadata": { + "id": "dc0a6728" + }, "outputs": [], "source": [] } @@ -3887,7 +3901,7 @@ "provenance": [] }, "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "dpk-1-basic-022dev1-py312", "language": "python", "name": "python3" }, @@ -3901,7 +3915,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.9" + "version": "3.12.7" } }, "nbformat": 4, From 96d680867676672e998b7e3cfcb0abb7c9101452 Mon Sep 17 00:00:00 2001 From: Sujee Maniyam Date: Wed, 16 Oct 2024 23:51:22 -0700 Subject: [PATCH 03/10] Fixing URLs Signed-off-by: Sujee Maniyam --- examples/notebooks/intro/dpk_intro_1_python.ipynb | 6 +++--- examples/notebooks/intro/dpk_intro_1_ray.ipynb | 8 ++++---- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/examples/notebooks/intro/dpk_intro_1_python.ipynb b/examples/notebooks/intro/dpk_intro_1_python.ipynb index 1049bf8d6..a6b2efff5 100644 --- a/examples/notebooks/intro/dpk_intro_1_python.ipynb +++ b/examples/notebooks/intro/dpk_intro_1_python.ipynb @@ -42,10 +42,10 @@ "source": [ "## Step-1: Inspect the Data\n", "\n", - "We will use simple PDFs about Solar system. The files are [here](https://github.com/sujee/data-prep-kit/tree/main/examples/notebooks/intro/input/solar-system)\n", + "We will use simple PDFs about Solar system. The files are [here](https://github.com/sujee/data-prep-kit/tree/intro-example1/examples/notebooks/intro/input/solar-system)\n", "\n", - "- [earth.pdf](https://github.com/sujee/data-prep-kit/blob/main/examples/notebooks/intro/input/solar-system/earth.pdf)\n", - "- [mars.pdf](https://github.com/sujee//blob/main/examples/notebooks/intro/input/solar-system/mars.pdf)\n" + "- [earth.pdf](https://github.com/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/input/solar-system/earth.pdf)\n", + "- [mars.pdf](https://github.com/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/input/solar-system/mars.pdf)\n" ] }, { diff --git a/examples/notebooks/intro/dpk_intro_1_ray.ipynb b/examples/notebooks/intro/dpk_intro_1_ray.ipynb index 6a14dedc7..631b79926 100644 --- a/examples/notebooks/intro/dpk_intro_1_ray.ipynb +++ b/examples/notebooks/intro/dpk_intro_1_ray.ipynb @@ -28,7 +28,7 @@ "\n", "Two options:\n", "\n", - "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/main/examples/notebooks/intro/dpk_intro_1_python.ipynb)\n", + "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/dpk_intro_1_python.ipynb)\n", "- **Option 2 - Local python dev environment:** Setup using this [guide](../../../README.md#-getting-started)\n", "\n", "The notebook will work as in both environments" @@ -43,10 +43,10 @@ "source": [ "## Step-1: Inspect the Data\n", "\n", - "We will use simple PDFs about Solar system. The files are [here](https://github.com/sujee/data-prep-kit/tree/main/examples/notebooks/intro/input/solar-system)\n", + "We will use simple PDFs about Solar system. The files are [here](https://github.com/sujee/data-prep-kit/tree/intro-example1/examples/notebooks/intro/input/solar-system)\n", "\n", - "- [earth.pdf](https://github.com/sujee/data-prep-kit/blob/main/examples/notebooks/intro/input/solar-system/earth.pdf)\n", - "- [mars.pdf](https://github.com/sujee//blob/main/examples/notebooks/intro/input/solar-system/mars.pdf)\n" + "- [earth.pdf](https://github.com/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/input/solar-system/earth.pdf)\n", + "- [mars.pdf](https://github.com/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/input/solar-system/mars.pdf)\n" ] }, { From 970c22a33a26fb090e737e03c14d343cb0802964 Mon Sep 17 00:00:00 2001 From: Sujee Maniyam Date: Thu, 17 Oct 2024 00:09:41 -0700 Subject: [PATCH 04/10] fix colab url --- examples/notebooks/intro/dpk_intro_1_ray.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/notebooks/intro/dpk_intro_1_ray.ipynb b/examples/notebooks/intro/dpk_intro_1_ray.ipynb index 631b79926..b39e30d2d 100644 --- a/examples/notebooks/intro/dpk_intro_1_ray.ipynb +++ b/examples/notebooks/intro/dpk_intro_1_ray.ipynb @@ -28,7 +28,7 @@ "\n", "Two options:\n", "\n", - "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/dpk_intro_1_python.ipynb)\n", + "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/dpk_intro_1_ray.ipynb)\n", "- **Option 2 - Local python dev environment:** Setup using this [guide](../../../README.md#-getting-started)\n", "\n", "The notebook will work as in both environments" From 469a90eba1c233ef20c427d7c92a36cc41b7db50 Mon Sep 17 00:00:00 2001 From: Sujee Maniyam Date: Fri, 18 Oct 2024 13:36:39 -0700 Subject: [PATCH 05/10] intro examples using DPK release 0.2.1 Signed-off-by: Sujee Maniyam --- examples/notebooks/intro/README.md | 18 +- .../notebooks/intro/dpk_intro_1_python.ipynb | 7308 ++++++++--------- .../notebooks/intro/dpk_intro_1_ray.ipynb | 1324 ++- 3 files changed, 4550 insertions(+), 4100 deletions(-) diff --git a/examples/notebooks/intro/README.md b/examples/notebooks/intro/README.md index 07b63f513..14d56e8e9 100644 --- a/examples/notebooks/intro/README.md +++ b/examples/notebooks/intro/README.md @@ -7,7 +7,23 @@ This is an example featuring some of the features of data prep kit. The code can be run on either 1. Google colab: very easy to run; no local setup needed. -2. On your local Python environment. Please follow the [instructions](../../../README.md#-getting-started) to setup +2. On your local Python environment. Here is a quick guide. You can find instructions for latest version [here](../../../README.md#-getting-started) + +```bash +conda create -n data-prep-kit -y python=3.11 +conda activate data-prep-kit + +# install the following in 'data-prep-kit' environment +pip3 install data-prep-toolkit-transforms==0.2.1 data-prep-toolkit-transforms-ray==0.2.1 +pip3 install jupyterlab ipykernel ipywidgets + +## install custom kernel +## Important: Use this kernel when running example notebooks! +python -m ipykernel install --user --name=data-prep-kit --display-name "dataprepkit" + +# start jupyter and run the notebooks with this jupyter +jupyter lab +``` ## Intro diff --git a/examples/notebooks/intro/dpk_intro_1_python.ipynb b/examples/notebooks/intro/dpk_intro_1_python.ipynb index a6b2efff5..91bb79060 100644 --- a/examples/notebooks/intro/dpk_intro_1_python.ipynb +++ b/examples/notebooks/intro/dpk_intro_1_python.ipynb @@ -1,3667 +1,3667 @@ { - "cells": [ - { - "cell_type": "markdown", - "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866", - "metadata": { - "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866" - }, - "source": [ - "# Data Prep Kit Demo 1 - Python version\n", - "\n", - "This notebook will introduce DPK and showcase some of it's capabilities.\n", - "\n", - "Here is the workflow\n", - "\n", - "![](https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/images/data-prep-kit-3-workflow.png)\n" - ] - }, - { - "cell_type": "markdown", - "id": "b15976e3", - "metadata": { - "id": "b15976e3" - }, - "source": [ - "## How to run this notebook\n", - "\n", - "Two options:\n", - "\n", - "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/dpk_intro_1_python.ipynb)\n", - "- **Option 2 - Local python dev environment:** Setup using this [guide](../../../README.md#-getting-started)\n", - "\n", - "The notebook will work as in both environments" - ] - }, - { - "cell_type": "markdown", - "id": "eb8b0d5c", - "metadata": { - "id": "eb8b0d5c" - }, - "source": [ - "## Step-1: Inspect the Data\n", - "\n", - "We will use simple PDFs about Solar system. The files are [here](https://github.com/sujee/data-prep-kit/tree/intro-example1/examples/notebooks/intro/input/solar-system)\n", - "\n", - "- [earth.pdf](https://github.com/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/input/solar-system/earth.pdf)\n", - "- [mars.pdf](https://github.com/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/input/solar-system/mars.pdf)\n" - ] - }, - { - "cell_type": "markdown", - "id": "39a0ab6e", - "metadata": { - "id": "39a0ab6e" - }, - "source": [ - "## Step-2: Figure out Runtime Environment\n", - "\n", - "### 2.1 - Determine runtime\n", - "\n", - "Determine if we are running on Google colab or local python environment" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "1fe354b7", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "1fe354b7", - "outputId": "5c153f72-08ed-4d6e-ccc7-dae851e7fd8b" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "NOT in Colab\n" - ] - } - ], - "source": [ - "import os\n", - "\n", - "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", - " print(\"Running in Colab\")\n", - " RUNNING_IN_COLAB = True\n", - "else:\n", - " print(\"NOT in Colab\")\n", - " RUNNING_IN_COLAB = False" - ] - }, - { - "cell_type": "markdown", - "id": "8e7c104b", - "metadata": { - "id": "8e7c104b" - }, - "source": [ - "### 2.2 -Download Data if running on Google Colab" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "3309799e", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "3309799e", - "outputId": "99530315-6dd5-405d-dbde-61e2332e441b" - }, - "outputs": [], - "source": [ - "if RUNNING_IN_COLAB:\n", - " !mkdir -p 'input/solar-system'\n", - " !wget -O 'input/solar-system/earth.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/input/solar-system/earth.pdf'\n", - " !wget -O 'input/solar-system/mars.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/input/solar-system/mars.pdf'\n", - " !wget -O 'my_utils.py' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/my_utils.py'" - ] - }, - { - "cell_type": "markdown", - "id": "a5dc2b68", - "metadata": { - "id": "a5dc2b68" - }, - "source": [ - "### 2.3 - Install dependencies if running on Google Colab" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "1fcec577", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 1000 - }, - "id": "1fcec577", - "outputId": "0f77fc39-ffeb-48da-ce6f-1750d8d3ad62" - }, - "outputs": [], - "source": [ - "if RUNNING_IN_COLAB:\n", - " ! pip install --default-timeout=100 \\\n", - " data-prep-toolkit[ray]==0.2.2.dev1 \\\n", - " data-prep-toolkit-transforms[ray,all]==0.2.2.dev1 \\\n", - " deepsearch-toolkit\n" - ] - }, - { - "cell_type": "markdown", - "id": "243322b8", - "metadata": { - "id": "243322b8" - }, - "source": [ - "### 2.4 - Restart Runtime\n", - "\n", - "After installing dependencies, be sure restart runtime, so libraries will be loaded\n", - "\n", - "You do this by going to **`Runtime --> Restart Session`**\n", - "\n", - "Then you can continue to the next step (no need to re-run the notebook)" - ] - }, - { - "cell_type": "markdown", - "id": "e8b10be1", - "metadata": { - "id": "e8b10be1" - }, - "source": [ - "## Step-2: Configuration" - ] - }, - { - "cell_type": "markdown", - "id": "356c66f7", - "metadata": { - "id": "356c66f7" - }, - "source": [ - "### 2.1 - Basic Config" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "e4YMZrBuFycl", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "e4YMZrBuFycl", - "outputId": "d7ee9449-4f21-4c9a-fa54-14b7f28d764a" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "NOT in Colab\n" - ] - } - ], - "source": [ - "import os\n", - "\n", - "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", - " print(\"Running in Colab\")\n", - " RUNNING_IN_COLAB = True\n", - "else:\n", - " print(\"NOT in Colab\")\n", - " RUNNING_IN_COLAB = False" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "33345487", - "metadata": { - "id": "33345487" - }, - "outputs": [], - "source": [ - "import os\n", - "\n", - "## Configuration\n", - "class MyConfig:\n", - " pass\n", - "\n", - "MY_CONFIG = MyConfig ()\n", - "\n", - "MY_CONFIG.INPUT_DATA_DIR = 'input/solar-system'\n", - "\n", - "MY_CONFIG.OUTPUT_FOLDER = \"output\"\n", - "MY_CONFIG.OUTPUT_FOLDER_FINAL = os.path.join(MY_CONFIG.OUTPUT_FOLDER , \"output_final\")\n", - "\n", - "## Embedding model\n", - "MY_CONFIG.EMBEDDING_MODEL = 'sentence-transformers/all-MiniLM-L6-v2'" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "b15e6827", - "metadata": { - "id": "b15e6827" - }, - "outputs": [], - "source": [ - "## Add parent dir to path\n", - "import os,sys\n", - "\n", - "this_dir = os.path.abspath('')\n", - "parent_dir = os.path.dirname(this_dir)\n", - "sys.path.append (os.path.abspath (parent_dir))" - ] - }, - { - "cell_type": "markdown", - "id": "72510ae6-48b0-4b88-9e13-a623281c3a63", - "metadata": { - "id": "72510ae6-48b0-4b88-9e13-a623281c3a63" - }, - "source": [ - "### 2.2 - Setup input/outpur directories" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "60ac8bee-0960-4309-b225-d7a211b14262", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "60ac8bee-0960-4309-b225-d7a211b14262", - "outputId": "4d5511fb-1c6f-47df-e5ea-2c1b354d262f" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Cleared output directory\n" - ] - } - ], - "source": [ - "import os, sys\n", - "import shutil\n", - "\n", - "if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):\n", - " raise Exception (f\"❌ Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found\")\n", - "\n", - "output_parquet_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '01_parquet_out')\n", - "output_chunk_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_chunk_out')\n", - "output_docid_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '03_docid_out')\n", - "output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '04_exact_dedupe_out')\n", - "output_embeddings_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '05_embeddings_out')\n", - "\n", - "## clear output folder\n", - "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)\n", - "shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)\n", - "\n", - "print (\"✅ Cleared output directory\")" - ] - }, - { - "cell_type": "markdown", - "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb", - "metadata": { - "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb" - }, - "source": [ - "## Step-3: pdf2parquet - Convert data from PDF to Parquet\n", - "\n", - "This step is reading the input folder containing all PDF files and ingest them in a parquet table using the [Docling package](https://github.com/DS4SD/docling).\n", - "The documents are converted into a JSON format which allows to easily chunk it in the later steps.\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "c0c574c4-9dc4-4dab-9ad6-b5338207e67a", - "metadata": { - "id": "c0c574c4-9dc4-4dab-9ad6-b5338207e67a" - }, - "source": [ - "### 3.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "482605b2-d814-456d-9195-49a2ec454ef0", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "482605b2-d814-456d-9195-49a2ec454ef0", - "outputId": "c50847d4-f2c7-4559-f5f7-d6a3d025027d" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-1: Processing input='input/solar-system' --> output='output/01_parquet_out'\n" - ] - } - ], - "source": [ - "STAGE = 1\n", - "\n", - "input_folder = MY_CONFIG.INPUT_DATA_DIR\n", - "output_folder = output_parquet_dir\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b", - "metadata": { - "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b" - }, - "source": [ - "### 3.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 657, - "referenced_widgets": [ - "97b603697cfa4b4ea4e6735b6768ca35", - "e87e8d3262c54cfaaa8768505edacda3", - "b78aa40816e44f7fbebcb24ca68818b3", - "7053c9606a414e978636a7e241909504", - "da0787b239764847a731083997780a85", - "553f3c16839a49d79591d0fc4862bed6", - "c0eb5bc8f6ee427ca42204b3c56f9a4e", - "9d184ed175f0403fb03c2e13dfd04e0a", - "724778729161445c98b187031ae4f67c", - "1cb3bbf7d724411cbe9831543a4aecc0", - "06f9b33494984e4885d5aad813d1d2bc" - ] - }, - "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", - "outputId": "01d207fb-983d-40b2-e5f6-e38e3789110a" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "22:43:02 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", - "22:43:02 INFO - pipeline id pipeline_id\n", - "22:43:02 INFO - code location None\n", - "22:43:02 INFO - data factory data_ is using local data access: input_folder - input/solar-system output_folder - output/01_parquet_out\n", - "22:43:02 INFO - data factory data_ max_files -1, n_sample -1\n", - "22:43:02 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", - "22:43:02 INFO - orchestrator pdf2parquet started at 2024-10-16 22:43:02\n", - "22:43:02 INFO - Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}\n", - "22:43:02 INFO - Initializing models\n" - ] - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "e92bbc86f5e34ee4ad7dd853a5136c01", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "Fetching 10 files: 0%| | 0/10 [00:00\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filename
0mars.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...101107bc0c9a-f863-48e3-9aed-bd289af040bcpdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf
1earth.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...1011e141f7a4-3e45-4f04-88d3-60e0a81b195bpdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdf
\n", - "" - ], - "text/plain": [ - " filename contents num_pages \\\n", - "0 mars.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", - "1 earth.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", - "\n", - " num_tables num_doc_elements document_id ext \\\n", - "0 0 11 07bc0c9a-f863-48e3-9aed-bd289af040bc pdf \n", - "1 0 11 e141f7a4-3e45-4f04-88d3-60e0a81b195b pdf \n", - "\n", - " hash size \\\n", - "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "1 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "\n", - " date_acquired pdf_convert_time source_filename \n", - "0 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "1 2024-10-16T22:43:07.205350 0.921915 earth.pdf " - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from my_utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Output dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.head(5)\n", - "\n", - "## To display certain columns\n", - "#parquet_df[['column1', 'column2', 'column3']].head(5)" - ] - }, - { - "cell_type": "markdown", - "id": "e5058a21", - "metadata": { - "id": "e5058a21" - }, - "source": [ - "\n", - "### 3.4 - Understand the output\n", - "\n", - "Here are some interesting attributes to note:\n", - "\n", - "- **filename** : original filename\n", - "- **contents** : text\n", - "- **document_id**: unique id (UUID) assignd to this document\n", - "- **hash** : hash of document\n", - "- **pdf_convert_time** : time to convert this pdf in seconds\n", - "\n", - "Let's inspect the **contents** column. See how the text is being divided up!" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "f870e624", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "f870e624", - "outputId": "0b4c054f-3a8a-4db3-f32f-17bd1466b102" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'_name': '',\n", - " 'description': {'logs': []},\n", - " 'equations': [],\n", - " 'figures': [],\n", - " 'file-info': {'#-pages': 1,\n", - " 'document-hash': '1a83f43f3a202e3f203c1263e36961ecc45d401aad488f638fc5559a584333b2',\n", - " 'filename': 'mars.pdf',\n", - " 'page-hashes': [{'hash': '551fe7a9bde2a9302f150c0a79a13fcc0868fcf73ac6afb80be645c1174734a0',\n", - " 'model': 'default',\n", - " 'page': 1}]},\n", - " 'footnotes': [],\n", - " 'main-text': [{'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.35137939,\n", - " 654.45184326,\n", - " 169.88169861,\n", - " 667.98492432],\n", - " 'page': 1,\n", - " 'span': [0, 4]}],\n", - " 'text': 'Mars',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.09541321,\n", - " 630.68127441,\n", - " 210.66503906,\n", - " 642.34405518],\n", - " 'page': 1,\n", - " 'span': [0, 12]}],\n", - " 'text': 'Solar System',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [132.84518433,\n", - " 588.96014404,\n", - " 479.40917969,\n", - " 623.02520752],\n", - " 'page': 1,\n", - " 'span': [0, 205]}],\n", - " 'text': 'Our solar system is a vast and fascinating expanse, '\n", - " 'comprising eight planets, five dwarf planets, '\n", - " 'numerous moons, asteroids, comets, and other '\n", - " 'celestial bodies. At its center lies the star we call '\n", - " 'the Sun.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [133.18510437,\n", - " 570.83258057,\n", - " 374.99838257,\n", - " 581.07043457],\n", - " 'page': 1,\n", - " 'span': [0, 54]}],\n", - " 'text': 'For more details about the Solar system see Chapter '\n", - " '1.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.22866821,\n", - " 542.98168945,\n", - " 163.86282349,\n", - " 554.45288086],\n", - " 'page': 1,\n", - " 'span': [0, 4]}],\n", - " 'text': 'Mars',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [132.87440491,\n", - " 500.84011841,\n", - " 477.48345947,\n", - " 534.55810547],\n", - " 'page': 1,\n", - " 'span': [0, 196]}],\n", - " 'text': 'Mars, the fourth planet from the Sun, is a cold, '\n", - " 'desert world with a thin atmosphere composed '\n", - " 'primarily of carbon dioxide. Its reddish hue comes '\n", - " 'from iron oxide, or rust, prevalent on its surface.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.2026062,\n", - " 482.90710449,\n", - " 237.04431152,\n", - " 493.07443237],\n", - " 'page': 1,\n", - " 'span': [0, 23]}],\n", - " 'text': 'Basic facts about Mars:',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 453.019104,\n", - " 477.48171997,\n", - " 474.9703064],\n", - " 'page': 1,\n", - " 'span': [0, 78]}],\n", - " 'text': '· Distance from the Sun: Average of 228 million '\n", - " 'kilometers (142 million miles)',\n", - " 'type': 'paragraph'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 440.79351807,\n", - " 431.73287964,\n", - " 451.2142334],\n", - " 'page': 1,\n", - " 'span': [0, 64]}],\n", - " 'text': '· Rotation Period: 24.6 hours (one Martian day - '\n", - " 'called a \"sol\")',\n", - " 'type': 'paragraph'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 429.10913086,\n", - " 365.9559021,\n", - " 438.83737183],\n", - " 'page': 1,\n", - " 'span': [0, 44]}],\n", - " 'text': '· Moons: Two small moons, Phobos and Deimos.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Page-footer',\n", - " 'prov': [{'bbox': [303.13299561,\n", - " 87.20314026,\n", - " 308.11428833,\n", - " 96.51646423],\n", - " 'page': 1,\n", - " 'span': [0, 1]}],\n", - " 'text': '1',\n", - " 'type': 'page-footer'}],\n", - " 'page-dimensions': [{'height': 792.0, 'page': 1, 'width': 612.0}],\n", - " 'page-footers': [],\n", - " 'page-headers': [],\n", - " 'tables': [],\n", - " 'type': 'pdf-document'}\n" - ] - } - ], - "source": [ - "import pprint\n", - "import json\n", - "\n", - "pprint.pprint (json.loads(output_df.iloc[0, ]['contents']))\n", - "# json.loads(output_df.iloc[0, ]['contents'])" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "e1a10c2d", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "e1a10c2d", - "outputId": "c1d992c2-faa8-40cd-c375-857970201daa" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'_name': '',\n", - " 'description': {'logs': []},\n", - " 'equations': [],\n", - " 'figures': [],\n", - " 'file-info': {'#-pages': 1,\n", - " 'document-hash': '7401ae81637dbb89e7040dcd5945bbfb75ff8648bb761c69f8a1595e86538748',\n", - " 'filename': 'earth.pdf',\n", - " 'page-hashes': [{'hash': 'ca802e4bd5a3301792808caea2a47db51f0520888875b77fc230c99ee851c19b',\n", - " 'model': 'default',\n", - " 'page': 1}]},\n", - " 'footnotes': [],\n", - " 'main-text': [{'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.30961609,\n", - " 654.45184326,\n", - " 174.04208374,\n", - " 667.93347168],\n", - " 'page': 1,\n", - " 'span': [0, 5]}],\n", - " 'text': 'Earth',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.12528992,\n", - " 630.69073486,\n", - " 210.66503906,\n", - " 642.27935791],\n", - " 'page': 1,\n", - " 'span': [0, 12]}],\n", - " 'text': 'Solar System',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [132.87112427,\n", - " 588.96014404,\n", - " 479.40917969,\n", - " 623.04595947],\n", - " 'page': 1,\n", - " 'span': [0, 205]}],\n", - " 'text': 'Our solar system is a vast and fascinating expanse, '\n", - " 'comprising eight planets, five dwarf planets, '\n", - " 'numerous moons, asteroids, comets, and other '\n", - " 'celestial bodies. At its center lies the star we call '\n", - " 'the Sun.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [133.20942688,\n", - " 570.81555176,\n", - " 375.57919312,\n", - " 581.08459473],\n", - " 'page': 1,\n", - " 'span': [0, 54]}],\n", - " 'text': 'For more details about our Solar system see Chapter '\n", - " '1.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Section-header',\n", - " 'prov': [{'bbox': [133.15542603,\n", - " 542.98168945,\n", - " 167.32983398,\n", - " 554.36669922],\n", - " 'page': 1,\n", - " 'span': [0, 5]}],\n", - " 'text': 'Earth',\n", - " 'type': 'subtitle-level-1'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [132.91053772,\n", - " 512.46295166,\n", - " 477.84887695,\n", - " 534.48431396],\n", - " 'page': 1,\n", - " 'span': [0, 107]}],\n", - " 'text': \"Earth is the third planet from the Sun. It's our home \"\n", - " 'planet. Earth is the only place we know of with life.',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Text',\n", - " 'prov': [{'bbox': [133.30151367,\n", - " 494.86206055,\n", - " 240.17156982,\n", - " 505.07229614],\n", - " 'page': 1,\n", - " 'span': [0, 24]}],\n", - " 'text': 'Basic facts about Earth:',\n", - " 'type': 'paragraph'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 464.97409058,\n", - " 477.47979736,\n", - " 487.02810669],\n", - " 'page': 1,\n", - " 'span': [0, 79]}],\n", - " 'text': '· Distance from the Sun: Average of 149.6 million '\n", - " 'kilometers (93 million miles)',\n", - " 'type': 'paragraph'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 452.86901855,\n", - " 317.90722656,\n", - " 463.24041748],\n", - " 'page': 1,\n", - " 'span': [0, 37]}],\n", - " 'text': '· Rotation Period: 24 hours (one day)',\n", - " 'type': 'paragraph'},\n", - " {'name': 'List-item',\n", - " 'prov': [{'bbox': [145.94500732,\n", - " 440.71496582,\n", - " 396.66357422,\n", - " 451.19915771],\n", - " 'page': 1,\n", - " 'span': [0, 52]}],\n", - " 'text': '· Moons: One moon, called Luna or simply \"the Moon\".',\n", - " 'type': 'paragraph'},\n", - " {'name': 'Page-footer',\n", - " 'prov': [{'bbox': [303.13299561,\n", - " 87.20314026,\n", - " 308.11428833,\n", - " 96.53633118],\n", - " 'page': 1,\n", - " 'span': [0, 1]}],\n", - " 'text': '1',\n", - " 'type': 'page-footer'}],\n", - " 'page-dimensions': [{'height': 792.0, 'page': 1, 'width': 612.0}],\n", - " 'page-footers': [],\n", - " 'page-headers': [],\n", - " 'tables': [],\n", - " 'type': 'pdf-document'}\n" - ] - } - ], - "source": [ - "pprint.pprint (json.loads(output_df.iloc[1, ]['contents']))" - ] - }, - { - "cell_type": "markdown", - "id": "72274586", - "metadata": { - "id": "72274586" - }, - "source": [ - "## Step-4: Doc chunks\n", - "\n", - "In the previous step, we have extracted text from oru PDFs. But we have the content of entire file as 'one row' in our parquet output.\n", - "\n", - "In this step, we are going to split the documents in chunks, according to their layout segmentation.\n", - "\n", - "This transform uses [Quackling](https://github.com/DS4SD/quackling) `HierarchicalChunker`\n", - "to chunk according to the document layout segmentation, i.e. respecting the original document components as paragraphs, tables, enumerations, etc.\n", - "It relies on documents converted with the Docling library in the [pdf2parquet transform](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/python/README.md) using the option `contents_type: \"application/json\"`,\n", - "which provides the required JSON structure." - ] - }, - { - "cell_type": "markdown", - "id": "96198fa6", - "metadata": { - "id": "96198fa6" - }, - "source": [ - "### 4.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "305f00a3", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "305f00a3", - "outputId": "dd511f34-bab3-4dde-d938-493debb02e5e" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-2: Processing input='output/01_parquet_out' --> output='output/02_chunk_out'\n" - ] - } - ], - "source": [ - "STAGE = 2\n", - "\n", - "input_folder = output_parquet_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_chunk_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "369f2cd1", - "metadata": { - "id": "369f2cd1" - }, - "source": [ - "### 4.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "5b7b18d5", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "5b7b18d5", - "outputId": "e0b87171-9d66-473f-e66a-e4b6ae3c3f66" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "22:43:09 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'dl_min_chunk_len': None, 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox', 'chunk_size_tokens': 128, 'chunk_overlap_tokens': 30}\n", - "22:43:09 INFO - pipeline id pipeline_id\n", - "22:43:09 INFO - code location None\n", - "22:43:09 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", - "22:43:09 INFO - data factory data_ max_files -1, n_sample -1\n", - "22:43:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "22:43:09 INFO - orchestrator doc_chunk started at 2024-10-16 22:43:09\n", - "22:43:09 INFO - Number of files is 2, source profile {'max_file_size': 0.02239513397216797, 'min_file_size': 0.02167987823486328, 'total_file_size': 0.04407501220703125}\n", - "22:43:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", - "22:43:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", - "22:43:09 INFO - Done processing 2 files, waiting for flush() completion.\n", - "22:43:09 INFO - done flushing in 0.0 sec\n", - "22:43:09 INFO - Completed execution in 0.0 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:2 completed successfully\n", - "CPU times: user 1.07 s, sys: 180 ms, total: 1.25 s\n", - "Wall time: 1.55 s\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "from data_processing.runtime.pure_python import PythonTransformLauncher\n", - "from doc_chunk_transform_python import DocChunkPythonTransformConfiguration\n", - "\n", - "\n", - "# Prepare the commandline params\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "params = {\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # doc_chunk arguments\n", - " # ...\n", - "}\n", - "\n", - "# Pass the commandline params\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "\n", - "# create launcher\n", - "launcher = PythonTransformLauncher(DocChunkPythonTransformConfiguration())\n", - "# launch\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "213afdf6", - "metadata": { - "id": "213afdf6" - }, - "source": [ - "### 4.3 - Inspect Generated output\n", - "\n", - "We would see documents are split into many chunks" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "d8138d43", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 897 - }, - "id": "d8138d43", - "outputId": "fd01e0cb-899e-4c73-d50e-5f4e6f5ff802" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Files processed : 2\n", - "Chunks created : 8\n", - "Input data dimensions (rows x columns)= (2, 12)\n", - "Output data dimensions (rows x columns)= (8, 16)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_id
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...
3mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcBasic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...
6earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...
7earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bEarth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "0 mars.pdf 1 0 11 pdf \n", - "1 mars.pdf 1 0 11 pdf \n", - "2 mars.pdf 1 0 11 pdf \n", - "3 mars.pdf 1 0 11 pdf \n", - "4 earth.pdf 1 0 11 pdf \n", - "5 earth.pdf 1 0 11 pdf \n", - "6 earth.pdf 1 0 11 pdf \n", - "7 earth.pdf 1 0 11 pdf \n", - "\n", - " hash size \\\n", - "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "1 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "2 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "3 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "4 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "5 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "6 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "7 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "\n", - " source_document_id \\\n", - "0 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "1 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "2 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "3 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "4 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "5 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "6 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "7 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "\n", - " contents doc_jsonpath \\\n", - "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", - "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "3 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", - "4 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "5 Solar System\\nFor more details about our Solar... $.main-text[3] \n", - "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", - "7 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", - "\n", - " page_number bbox \\\n", - "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", - "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", - "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", - "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", - "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", - "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", - "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", - "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", - "\n", - " document_id \n", - "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", - "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", - "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", - "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", - "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", - "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... " - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from my_utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (f\"Files processed : {input_df.shape[0]:,}\")\n", - "print (f\"Chunks created : {output_df.shape[0]:,}\")\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "id": "9e9ca75c", - "metadata": { - "id": "9e9ca75c" - }, - "source": [ - "### 4.4 - Understanding the Output\n", - "\n", - "Here we see 2 PDF files are split into 6 chunks. Basically we see the documents are being split along 'natural boundaris' - paragraphs and bullet points\n", - "\n", - "See how **document_id** is carried throughout. This helps us identify original documents.\n", - "\n", - "Also note **contents** is now plain text (not JSON as before)" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "3090c950", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 300 - }, - "id": "3090c950", - "outputId": "0f4b6771-8d38-4a27-c756-21f916b23a4f" - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamecontents
0mars.pdfSolar System\\nOur solar system is a vast and f...
1mars.pdfSolar System\\nFor more details about the Solar...
2mars.pdfMars\\nMars, the fourth planet from the Sun, is...
3mars.pdfBasic facts about Mars:\\n· Distance from the S...
4earth.pdfSolar System\\nOur solar system is a vast and f...
5earth.pdfSolar System\\nFor more details about our Solar...
6earth.pdfEarth\\nEarth is the third planet from the Sun....
7earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...
\n", - "
" - ], - "text/plain": [ - " filename contents\n", - "0 mars.pdf Solar System\\nOur solar system is a vast and f...\n", - "1 mars.pdf Solar System\\nFor more details about the Solar...\n", - "2 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", - "3 mars.pdf Basic facts about Mars:\\n· Distance from the S...\n", - "4 earth.pdf Solar System\\nOur solar system is a vast and f...\n", - "5 earth.pdf Solar System\\nFor more details about our Solar...\n", - "6 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", - "7 earth.pdf Earth\\nBasic facts about Earth:\\n· Distance fr..." - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "output_df[['filename', 'contents']]" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "d5f151ae", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "d5f151ae", - "outputId": "a4c491b2-53db-4d71-da24-4479de8d1d65" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "========== mars.pdf ===========\n", - "-------Chunk 0------\n", - "Solar System\n", - "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", - "-------\n", - "-------Chunk 1------\n", - "Solar System\n", - "For more details about the Solar system see Chapter 1.\n", - "-------\n", - "-------Chunk 2------\n", - "Mars\n", - "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", - "-------\n", - "-------Chunk 3------\n", - "Basic facts about Mars:\n", - "· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", - "· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", - "· Moons: Two small moons, Phobos and Deimos.\n", - "-------\n", - "========== earth.pdf ===========\n", - "-------Chunk 0------\n", - "Solar System\n", - "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", - "-------\n", - "-------Chunk 1------\n", - "Solar System\n", - "For more details about our Solar system see Chapter 1.\n", - "-------\n", - "-------Chunk 2------\n", - "Earth\n", - "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", - "-------\n", - "-------Chunk 3------\n", - "Earth\n", - "Basic facts about Earth:\n", - "· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", - "· Rotation Period: 24 hours (one day)\n", - "· Moons: One moon, called Luna or simply \"the Moon\".\n", - "-------\n" - ] - } - ], - "source": [ - "for f in output_df['filename'].unique():\n", - " print ('==========' , f, '===========')\n", - " chunks = output_df[output_df['filename'] == f]['contents']\n", - " for idx , chunk in enumerate(chunks):\n", - " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" - ] - }, - { - "cell_type": "markdown", - "id": "7ad1c60d", - "metadata": { - "id": "7ad1c60d" - }, - "source": [ - "## Step-5: DOC ID generation of Chunks\n", - "\n", - "This transform annotates documents with document \"ids\". It supports the following transformations of the original data:\n", - "\n", - " - Adding document hash: this enables the addition of a document hash-based id to the data. The hash is calculated with `hashlib.sha256(doc.encode(\"utf-8\")).hexdigest()`. To enable this annotation, set **hash_column** to the name of the column, where you want to store it.\n", - " - Adding integer document id: this allows the addition of an integer document id to the data that is unique across all rows in all tables provided to the transform() method. To enable this annotation, set **int_id_column** to the name of the column, where you want to store it.\n", - "\n", - "**This is a pre-requisite for fuzzy dedup** in the pipeline." - ] - }, - { - "cell_type": "markdown", - "id": "1afaa0fd", - "metadata": { - "id": "1afaa0fd" - }, - "source": [ - "### 5.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "6ffd6f54", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "6ffd6f54", - "outputId": "1784c80d-6309-4913-9f55-c018b978968f" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-3: Processing input='output/02_chunk_out' --> output='output/03_docid_out'\n" - ] - } - ], - "source": [ - "\n", - "# Input for this stage is the output of exact dedeup component\n", - "# output of this component makes it possible for fdedup component to run on data.\n", - "\n", - "STAGE = 3\n", - "\n", - "input_folder = output_chunk_dir\n", - "output_folder = output_docid_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "f78a51b7", - "metadata": { - "id": "f78a51b7" - }, - "source": [ - "### 5.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "5fc77557", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "5fc77557", - "outputId": "db2b8670-543e-4073-9c7d-3f9ef5f4317e" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "22:43:09 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", - "22:43:09 INFO - pipeline id pipeline_id\n", - "22:43:09 INFO - code location None\n", - "22:43:09 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", - "22:43:09 INFO - data factory data_ max_files -1, n_sample -1\n", - "22:43:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "22:43:09 INFO - orchestrator doc_id started at 2024-10-16 22:43:09\n", - "22:43:09 INFO - Number of files is 2, source profile {'max_file_size': 0.008975982666015625, 'min_file_size': 0.008897781372070312, 'total_file_size': 0.017873764038085938}\n", - "22:43:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", - "22:43:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", - "22:43:09 INFO - Done processing 2 files, waiting for flush() completion.\n", - "22:43:09 INFO - done flushing in 0.0 sec\n", - "22:43:09 INFO - Completed execution in 0.0 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:3 completed successfully\n", - "CPU times: user 10.1 ms, sys: 3 ms, total: 13.1 ms\n", - "Wall time: 11.3 ms\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "from data_processing.runtime.pure_python import PythonTransformLauncher\n", - "from doc_id_transform_python import DocIDPythonTransformRuntimeConfiguration\n", - "\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "params = {\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # orchestrator\n", - " # doc id configuration\n", - " \"doc_id_doc_column\": \"contents\",\n", - " \"doc_id_hash_column\": \"chunk_hash\",\n", - " \"doc_id_int_column\": \"chunk_id\",\n", - "}\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "\n", - "# launch\n", - "\n", - "launcher = PythonTransformLauncher(DocIDPythonTransformRuntimeConfiguration())\n", - "\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "a9a8c1fa", - "metadata": { - "id": "a9a8c1fa" - }, - "source": [ - "### 5.3 - Inspect Generated output\n", - "\n", - "You will notice we have two extra columns\n", - "\n", - "- **hash_column**\n", - "- **int_id_column**\n", - "\n", - "But still the same number or rows as before" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "id": "da9adede", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 860 - }, - "id": "da9adede", - "outputId": "036db4ca-12f6-4b3e-9d7f-fa70e494870d" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Input data dimensions (rows x columns)= (8, 16)\n", - "Output data dimensions (rows x columns)= (8, 18)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_hashchunk_id
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...4
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...5
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...a31663e06fac41470ecc459f5a58658a3f9997d7801053...6
3mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcBasic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...1
6earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...2
7earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bEarth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...3
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "0 mars.pdf 1 0 11 pdf \n", - "1 mars.pdf 1 0 11 pdf \n", - "2 mars.pdf 1 0 11 pdf \n", - "3 mars.pdf 1 0 11 pdf \n", - "4 earth.pdf 1 0 11 pdf \n", - "5 earth.pdf 1 0 11 pdf \n", - "6 earth.pdf 1 0 11 pdf \n", - "7 earth.pdf 1 0 11 pdf \n", - "\n", - " hash size \\\n", - "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "1 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "2 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "3 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "4 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "5 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "6 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "7 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "\n", - " source_document_id \\\n", - "0 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "1 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "2 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "3 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "4 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "5 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "6 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "7 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "\n", - " contents doc_jsonpath \\\n", - "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", - "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "3 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", - "4 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "5 Solar System\\nFor more details about our Solar... $.main-text[3] \n", - "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", - "7 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", - "\n", - " page_number bbox \\\n", - "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", - "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", - "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", - "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", - "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", - "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", - "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", - "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", - "\n", - " document_id \\\n", - "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", - "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", - "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", - "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", - "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", - "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", - "\n", - " chunk_hash chunk_id \n", - "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 \n", - "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", - "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", - "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", - "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", - "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", - "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", - "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 " - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from my_utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "id": "4692975c-49ff-41ae-810e-0f5bc0bbdc53", - "metadata": { - "id": "4692975c-49ff-41ae-810e-0f5bc0bbdc53" - }, - "source": [ - "## Step-6: Exact Dedup\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "5acfd3a2-a236-4143-bcfc-15804f1da7fe", - "metadata": { - "id": "5acfd3a2-a236-4143-bcfc-15804f1da7fe" - }, - "source": [ - "### 6.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "id": "4c7a1b94", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "4c7a1b94", - "outputId": "2f6f05bc-f6fd-4d66-ea01-ed89cd5b80f3" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-4: Processing input='output/03_docid_out' --> output='output/04_exact_dedupe_out'\n" - ] - } - ], - "source": [ - "STAGE = 4\n", - "\n", - "input_folder = output_docid_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_exact_dedupe_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "3661cb37-39c7-4b09-a784-925bfa9eaf1e", - "metadata": { - "id": "3661cb37-39c7-4b09-a784-925bfa9eaf1e" - }, - "source": [ - "### 6.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", - "outputId": "74dc0b75-58b5-4c97-9965-91315e8a98a5" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "22:43:09 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None}\n", - "22:43:09 INFO - pipeline id pipeline_id\n", - "22:43:09 INFO - code location None\n", - "22:43:09 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", - "22:43:09 INFO - data factory data_ max_files -1, n_sample -1\n", - "22:43:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "22:43:09 INFO - orchestrator ededup started at 2024-10-16 22:43:09\n", - "22:43:09 INFO - Number of files is 2, source profile {'max_file_size': 0.010180473327636719, 'min_file_size': 0.010101318359375, 'total_file_size': 0.02028179168701172}\n", - "22:43:09 INFO - Starting from the beginning\n", - "22:43:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", - "22:43:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", - "22:43:09 INFO - Done processing 2 files, waiting for flush() completion.\n", - "22:43:09 INFO - done flushing in 0.0 sec\n", - "22:43:09 INFO - Completed execution in 0.0 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:4 completed successfully\n", - "CPU times: user 12.6 ms, sys: 5.26 ms, total: 17.9 ms\n", - "Wall time: 14.6 ms\n" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "from data_processing.runtime.pure_python import PythonTransformLauncher\n", - "from ededup_transform_python import EdedupPythonTransformRuntimeConfiguration\n", - "\n", - "\n", - "# Prepare the commandline params\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "params = {\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # ededup parameters\n", - " \"ededup_doc_column\": \"contents\",\n", - " \"ededup_doc_id_column\": \"chunk_hash\",\n", - "}\n", - "\n", - "# Pass the commandline params\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "\n", - "# create launcher\n", - "launcher = PythonTransformLauncher(EdedupPythonTransformRuntimeConfiguration())\n", - "# launch\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "eaf1c3c3", - "metadata": { - "id": "eaf1c3c3" - }, - "source": [ - "### 6.3 - Inspect Generated output" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "id": "d824ebf6", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 815 - }, - "id": "d824ebf6", - "outputId": "68f55770-c750-4607-a205-ba183603019d" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Input data dimensions (rows x columns)= (8, 18)\n", - "Output data dimensions (rows x columns)= (7, 19)\n", - "Input chunks before exact dedupe : 8\n", - "Output chunks after exact dedupe : 7\n", - "Duplicate chunks removed : 1\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_hashchunk_idremoved
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...5[44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...a31663e06fac41470ecc459f5a58658a3f9997d7801053...6[]
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcBasic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7[]
3earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0[]
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...1[]
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...2[]
6earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bEarth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...3[]
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "0 mars.pdf 1 0 11 pdf \n", - "1 mars.pdf 1 0 11 pdf \n", - "2 mars.pdf 1 0 11 pdf \n", - "3 earth.pdf 1 0 11 pdf \n", - "4 earth.pdf 1 0 11 pdf \n", - "5 earth.pdf 1 0 11 pdf \n", - "6 earth.pdf 1 0 11 pdf \n", - "\n", - " hash size \\\n", - "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "1 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "2 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "3 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "4 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "5 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "6 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "\n", - " source_document_id \\\n", - "0 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "1 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "2 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "3 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "4 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "5 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "6 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "\n", - " contents doc_jsonpath \\\n", - "0 Solar System\\nFor more details about the Solar... $.main-text[3] \n", - "1 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "2 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", - "3 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "4 Solar System\\nFor more details about our Solar... $.main-text[3] \n", - "5 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", - "6 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", - "\n", - " page_number bbox \\\n", - "0 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", - "1 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", - "2 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", - "3 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", - "4 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", - "5 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", - "6 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", - "\n", - " document_id \\\n", - "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", - "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", - "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", - "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", - "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", - "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", - "\n", - " chunk_hash chunk_id \\\n", - "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", - "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", - "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", - "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", - "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", - "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", - "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 \n", - "\n", - " removed \n", - "0 [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567... \n", - "1 [] \n", - "2 [] \n", - "3 [] \n", - "4 [] \n", - "5 [] \n", - "6 [] " - ] - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" - } + "cells": [ + { + "cell_type": "markdown", + "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866", + "metadata": { + "id": "841e533d-ebb3-406d-9da7-b19e2c5f5866" + }, + "source": [ + "# Data Prep Kit Demo 1 - Python version\n", + "\n", + "This notebook will introduce DPK and showcase some of it's capabilities.\n", + "\n", + "Here is the workflow\n", + "\n", + "![](https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/images/data-prep-kit-3-workflow.png)\n" + ] + }, + { + "cell_type": "markdown", + "id": "b15976e3", + "metadata": { + "id": "b15976e3" + }, + "source": [ + "## How to run this notebook\n", + "\n", + "Two options:\n", + "\n", + "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/dpk_intro_1_python.ipynb)\n", + "- **Option 2 - Local python dev environment:** Setup using this [guide](../../../README.md#-getting-started)\n", + "\n", + "The notebook will work as in both environments" + ] + }, + { + "cell_type": "markdown", + "id": "eb8b0d5c", + "metadata": { + "id": "eb8b0d5c" + }, + "source": [ + "## Step-1: Inspect the Data\n", + "\n", + "We will use simple PDFs about Solar system. The files are [here](https://github.com/sujee/data-prep-kit/tree/intro-example1/examples/notebooks/intro/input/solar-system)\n", + "\n", + "- [earth.pdf](https://github.com/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/input/solar-system/earth.pdf)\n", + "- [mars.pdf](https://github.com/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/input/solar-system/mars.pdf)\n" + ] + }, + { + "cell_type": "markdown", + "id": "39a0ab6e", + "metadata": { + "id": "39a0ab6e" + }, + "source": [ + "## Step-2: Figure out Runtime Environment\n", + "\n", + "### 2.1 - Determine runtime\n", + "\n", + "Determine if we are running on Google colab or local python environment" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "1fe354b7", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "1fe354b7", + "outputId": "5c153f72-08ed-4d6e-ccc7-dae851e7fd8b" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NOT in Colab\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", + " print(\"Running in Colab\")\n", + " RUNNING_IN_COLAB = True\n", + "else:\n", + " print(\"NOT in Colab\")\n", + " RUNNING_IN_COLAB = False" + ] + }, + { + "cell_type": "markdown", + "id": "8e7c104b", + "metadata": { + "id": "8e7c104b" + }, + "source": [ + "### 2.2 -Download Data if running on Google Colab" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "3309799e", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "3309799e", + "outputId": "99530315-6dd5-405d-dbde-61e2332e441b" + }, + "outputs": [], + "source": [ + "if RUNNING_IN_COLAB:\n", + " !mkdir -p 'input/solar-system'\n", + " !wget -O 'input/solar-system/earth.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/input/solar-system/earth.pdf'\n", + " !wget -O 'input/solar-system/mars.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/input/solar-system/mars.pdf'\n", + " !wget -O 'my_utils.py' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/my_utils.py'" + ] + }, + { + "cell_type": "markdown", + "id": "a5dc2b68", + "metadata": { + "id": "a5dc2b68" + }, + "source": [ + "### 2.3 - Install dependencies if running on Google Colab" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "1fcec577", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "1fcec577", + "outputId": "0f77fc39-ffeb-48da-ce6f-1750d8d3ad62" + }, + "outputs": [], + "source": [ + "if RUNNING_IN_COLAB:\n", + " ! pip install --default-timeout=100 \\\n", + " data-prep-toolkit-transforms==0.2.1 \\\n", + " data-prep-toolkit-transforms-ray==0.2.1 \\\n", + " deepsearch-toolkit\n" + ] + }, + { + "cell_type": "markdown", + "id": "243322b8", + "metadata": { + "id": "243322b8" + }, + "source": [ + "### 2.4 - Restart Runtime\n", + "\n", + "After installing dependencies, be sure restart runtime, so libraries will be loaded\n", + "\n", + "You do this by going to **`Runtime --> Restart Session`**\n", + "\n", + "Then you can continue to the next step (no need to re-run the notebook)" + ] + }, + { + "cell_type": "markdown", + "id": "e8b10be1", + "metadata": { + "id": "e8b10be1" + }, + "source": [ + "## Step-2: Configuration" + ] + }, + { + "cell_type": "markdown", + "id": "356c66f7", + "metadata": { + "id": "356c66f7" + }, + "source": [ + "### 2.1 - Basic Config" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "e4YMZrBuFycl", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "e4YMZrBuFycl", + "outputId": "d7ee9449-4f21-4c9a-fa54-14b7f28d764a" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NOT in Colab\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", + " print(\"Running in Colab\")\n", + " RUNNING_IN_COLAB = True\n", + "else:\n", + " print(\"NOT in Colab\")\n", + " RUNNING_IN_COLAB = False" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "33345487", + "metadata": { + "id": "33345487" + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "## Configuration\n", + "class MyConfig:\n", + " pass\n", + "\n", + "MY_CONFIG = MyConfig ()\n", + "\n", + "MY_CONFIG.INPUT_DATA_DIR = 'input/solar-system'\n", + "\n", + "MY_CONFIG.OUTPUT_FOLDER = \"output\"\n", + "MY_CONFIG.OUTPUT_FOLDER_FINAL = os.path.join(MY_CONFIG.OUTPUT_FOLDER , \"output_final\")\n", + "\n", + "## Embedding model\n", + "MY_CONFIG.EMBEDDING_MODEL = 'sentence-transformers/all-MiniLM-L6-v2'" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "b15e6827", + "metadata": { + "id": "b15e6827" + }, + "outputs": [], + "source": [ + "## Add parent dir to path\n", + "import os,sys\n", + "\n", + "this_dir = os.path.abspath('')\n", + "parent_dir = os.path.dirname(this_dir)\n", + "sys.path.append (os.path.abspath (parent_dir))" + ] + }, + { + "cell_type": "markdown", + "id": "72510ae6-48b0-4b88-9e13-a623281c3a63", + "metadata": { + "id": "72510ae6-48b0-4b88-9e13-a623281c3a63" + }, + "source": [ + "### 2.2 - Setup input/outpur directories" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "60ac8bee-0960-4309-b225-d7a211b14262", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "60ac8bee-0960-4309-b225-d7a211b14262", + "outputId": "4d5511fb-1c6f-47df-e5ea-2c1b354d262f" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Cleared output directory\n" + ] + } + ], + "source": [ + "import os, sys\n", + "import shutil\n", + "\n", + "if not os.path.exists(MY_CONFIG.INPUT_DATA_DIR ):\n", + " raise Exception (f\"❌ Input folder MY_CONFIG.INPUT_DATA_DIR = '{MY_CONFIG.INPUT_DATA_DIR}' not found\")\n", + "\n", + "output_parquet_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '01_parquet_out')\n", + "output_chunk_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_chunk_out')\n", + "output_docid_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '03_docid_out')\n", + "output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '04_exact_dedupe_out')\n", + "output_embeddings_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '05_embeddings_out')\n", + "\n", + "## clear output folder\n", + "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)\n", + "shutil.os.makedirs(MY_CONFIG.OUTPUT_FOLDER, exist_ok=True)\n", + "\n", + "print (\"✅ Cleared output directory\")" + ] + }, + { + "cell_type": "markdown", + "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb", + "metadata": { + "id": "2449e5c7-078c-4ad6-a2f6-21d39d4da3fb" + }, + "source": [ + "## Step-3: pdf2parquet - Convert data from PDF to Parquet\n", + "\n", + "This step is reading the input folder containing all PDF files and ingest them in a parquet table using the [Docling package](https://github.com/DS4SD/docling).\n", + "The documents are converted into a JSON format which allows to easily chunk it in the later steps.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "c0c574c4-9dc4-4dab-9ad6-b5338207e67a", + "metadata": { + "id": "c0c574c4-9dc4-4dab-9ad6-b5338207e67a" + }, + "source": [ + "### 3.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "482605b2-d814-456d-9195-49a2ec454ef0", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "482605b2-d814-456d-9195-49a2ec454ef0", + "outputId": "c50847d4-f2c7-4559-f5f7-d6a3d025027d" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-1: Processing input='input/solar-system' --> output='output/01_parquet_out'\n" + ] + } + ], + "source": [ + "STAGE = 1\n", + "\n", + "input_folder = MY_CONFIG.INPUT_DATA_DIR\n", + "output_folder = output_parquet_dir\n", + "\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b", + "metadata": { + "id": "9bb15f02-ab5c-4525-a536-cfa1fd2ba70b" + }, + "source": [ + "### 3.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 657, + "referenced_widgets": [ + "97b603697cfa4b4ea4e6735b6768ca35", + "e87e8d3262c54cfaaa8768505edacda3", + "b78aa40816e44f7fbebcb24ca68818b3", + "7053c9606a414e978636a7e241909504", + "da0787b239764847a731083997780a85", + "553f3c16839a49d79591d0fc4862bed6", + "c0eb5bc8f6ee427ca42204b3c56f9a4e", + "9d184ed175f0403fb03c2e13dfd04e0a", + "724778729161445c98b187031ae4f67c", + "1cb3bbf7d724411cbe9831543a4aecc0", + "06f9b33494984e4885d5aad813d1d2bc" + ] + }, + "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", + "outputId": "01d207fb-983d-40b2-e5f6-e38e3789110a" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "13:34:39 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", + "13:34:39 INFO - pipeline id pipeline_id\n", + "13:34:39 INFO - code location None\n", + "13:34:39 INFO - data factory data_ is using local data access: input_folder - input/solar-system output_folder - output/01_parquet_out\n", + "13:34:39 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:34:39 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", + "13:34:39 INFO - orchestrator pdf2parquet started at 2024-10-18 13:34:39\n", + "13:34:39 INFO - Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}\n", + "13:34:39 INFO - Initializing models\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "750f3b6951094b2eb68490c7f5f98148", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Fetching 10 files: 0%| | 0/10 [00:00\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontentsnum_pagesnum_tablesnum_doc_elementsdocument_idexthashsizedate_acquiredpdf_convert_timesource_filename
0mars.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...10116e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf
1earth.pdf{\"_name\":\"\",\"type\":\"pdf-document\",\"description...1011efbdbcb9-f0af-42f0-b191-2f14ce3ddc7cpdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdf
\n", + "" ], - "source": [ - "from my_utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "print (f\"Input chunks before exact dedupe : {input_df.shape[0]:,}\")\n", - "print (f\"Output chunks after exact dedupe : {output_df.shape[0]:,}\")\n", - "print (\"Duplicate chunks removed : \", (input_df.shape[0] - output_df.shape[0]))\n", - "\n", - "output_df.head(10)" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "id": "82cc9bb0", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 269 - }, - "id": "82cc9bb0", - "outputId": "46d9e91d-c470-4e3e-e5c8-508c534dbceb" - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamecontents
0mars.pdfSolar System\\nFor more details about the Solar...
1mars.pdfMars\\nMars, the fourth planet from the Sun, is...
2mars.pdfBasic facts about Mars:\\n· Distance from the S...
3earth.pdfSolar System\\nOur solar system is a vast and f...
4earth.pdfSolar System\\nFor more details about our Solar...
5earth.pdfEarth\\nEarth is the third planet from the Sun....
6earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...
\n", - "
" - ], - "text/plain": [ - " filename contents\n", - "0 mars.pdf Solar System\\nFor more details about the Solar...\n", - "1 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", - "2 mars.pdf Basic facts about Mars:\\n· Distance from the S...\n", - "3 earth.pdf Solar System\\nOur solar system is a vast and f...\n", - "4 earth.pdf Solar System\\nFor more details about our Solar...\n", - "5 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", - "6 earth.pdf Earth\\nBasic facts about Earth:\\n· Distance fr..." - ] - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" - } + "text/plain": [ + " filename contents num_pages \\\n", + "0 mars.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", + "1 earth.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", + "\n", + " num_tables num_doc_elements document_id ext \\\n", + "0 0 11 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 pdf \n", + "1 0 11 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \n", + "0 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "1 2024-10-18T13:34:43.410297 0.794765 earth.pdf " + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Output dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(5)\n", + "\n", + "## To display certain columns\n", + "#parquet_df[['column1', 'column2', 'column3']].head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "e5058a21", + "metadata": { + "id": "e5058a21" + }, + "source": [ + "\n", + "### 3.4 - Understand the output\n", + "\n", + "Here are some interesting attributes to note:\n", + "\n", + "- **filename** : original filename\n", + "- **contents** : text\n", + "- **document_id**: unique id (UUID) assignd to this document\n", + "- **hash** : hash of document\n", + "- **pdf_convert_time** : time to convert this pdf in seconds\n", + "\n", + "Let's inspect the **contents** column. See how the text is being divided up!" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "f870e624", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "f870e624", + "outputId": "0b4c054f-3a8a-4db3-f32f-17bd1466b102" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'_name': '',\n", + " 'description': {'logs': []},\n", + " 'equations': [],\n", + " 'figures': [],\n", + " 'file-info': {'#-pages': 1,\n", + " 'document-hash': '1a83f43f3a202e3f203c1263e36961ecc45d401aad488f638fc5559a584333b2',\n", + " 'filename': 'mars.pdf',\n", + " 'page-hashes': [{'hash': '551fe7a9bde2a9302f150c0a79a13fcc0868fcf73ac6afb80be645c1174734a0',\n", + " 'model': 'default',\n", + " 'page': 1}]},\n", + " 'footnotes': [],\n", + " 'main-text': [{'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.35137939,\n", + " 654.45184326,\n", + " 169.88169861,\n", + " 667.98492432],\n", + " 'page': 1,\n", + " 'span': [0, 4]}],\n", + " 'text': 'Mars',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.09541321,\n", + " 630.68127441,\n", + " 210.66503906,\n", + " 642.34405518],\n", + " 'page': 1,\n", + " 'span': [0, 12]}],\n", + " 'text': 'Solar System',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.84518433,\n", + " 588.96014404,\n", + " 479.40917969,\n", + " 623.02520752],\n", + " 'page': 1,\n", + " 'span': [0, 205]}],\n", + " 'text': 'Our solar system is a vast and fascinating expanse, '\n", + " 'comprising eight planets, five dwarf planets, '\n", + " 'numerous moons, asteroids, comets, and other '\n", + " 'celestial bodies. At its center lies the star we call '\n", + " 'the Sun.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [133.18510437,\n", + " 570.83258057,\n", + " 374.99838257,\n", + " 581.07043457],\n", + " 'page': 1,\n", + " 'span': [0, 54]}],\n", + " 'text': 'For more details about the Solar system see Chapter '\n", + " '1.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.22866821,\n", + " 542.98168945,\n", + " 163.86282349,\n", + " 554.45288086],\n", + " 'page': 1,\n", + " 'span': [0, 4]}],\n", + " 'text': 'Mars',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.87440491,\n", + " 500.84011841,\n", + " 477.48345947,\n", + " 534.55810547],\n", + " 'page': 1,\n", + " 'span': [0, 196]}],\n", + " 'text': 'Mars, the fourth planet from the Sun, is a cold, '\n", + " 'desert world with a thin atmosphere composed '\n", + " 'primarily of carbon dioxide. Its reddish hue comes '\n", + " 'from iron oxide, or rust, prevalent on its surface.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.2026062,\n", + " 482.90710449,\n", + " 237.04431152,\n", + " 493.07443237],\n", + " 'page': 1,\n", + " 'span': [0, 23]}],\n", + " 'text': 'Basic facts about Mars:',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 453.019104,\n", + " 477.48171997,\n", + " 474.9703064],\n", + " 'page': 1,\n", + " 'span': [0, 78]}],\n", + " 'text': '· Distance from the Sun: Average of 228 million '\n", + " 'kilometers (142 million miles)',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 440.79351807,\n", + " 431.73287964,\n", + " 451.2142334],\n", + " 'page': 1,\n", + " 'span': [0, 64]}],\n", + " 'text': '· Rotation Period: 24.6 hours (one Martian day - '\n", + " 'called a \"sol\")',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 429.10913086,\n", + " 365.9559021,\n", + " 438.83737183],\n", + " 'page': 1,\n", + " 'span': [0, 44]}],\n", + " 'text': '· Moons: Two small moons, Phobos and Deimos.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Page-footer',\n", + " 'prov': [{'bbox': [303.13299561,\n", + " 87.20314026,\n", + " 308.11428833,\n", + " 96.51646423],\n", + " 'page': 1,\n", + " 'span': [0, 1]}],\n", + " 'text': '1',\n", + " 'type': 'page-footer'}],\n", + " 'page-dimensions': [{'height': 792.0, 'page': 1, 'width': 612.0}],\n", + " 'page-footers': [],\n", + " 'page-headers': [],\n", + " 'tables': [],\n", + " 'type': 'pdf-document'}\n" + ] + } + ], + "source": [ + "import pprint\n", + "import json\n", + "\n", + "pprint.pprint (json.loads(output_df.iloc[0, ]['contents']))\n", + "# json.loads(output_df.iloc[0, ]['contents'])" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "e1a10c2d", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "e1a10c2d", + "outputId": "c1d992c2-faa8-40cd-c375-857970201daa" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'_name': '',\n", + " 'description': {'logs': []},\n", + " 'equations': [],\n", + " 'figures': [],\n", + " 'file-info': {'#-pages': 1,\n", + " 'document-hash': '7401ae81637dbb89e7040dcd5945bbfb75ff8648bb761c69f8a1595e86538748',\n", + " 'filename': 'earth.pdf',\n", + " 'page-hashes': [{'hash': 'ca802e4bd5a3301792808caea2a47db51f0520888875b77fc230c99ee851c19b',\n", + " 'model': 'default',\n", + " 'page': 1}]},\n", + " 'footnotes': [],\n", + " 'main-text': [{'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.30961609,\n", + " 654.45184326,\n", + " 174.04208374,\n", + " 667.93347168],\n", + " 'page': 1,\n", + " 'span': [0, 5]}],\n", + " 'text': 'Earth',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.12528992,\n", + " 630.69073486,\n", + " 210.66503906,\n", + " 642.27935791],\n", + " 'page': 1,\n", + " 'span': [0, 12]}],\n", + " 'text': 'Solar System',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.87112427,\n", + " 588.96014404,\n", + " 479.40917969,\n", + " 623.04595947],\n", + " 'page': 1,\n", + " 'span': [0, 205]}],\n", + " 'text': 'Our solar system is a vast and fascinating expanse, '\n", + " 'comprising eight planets, five dwarf planets, '\n", + " 'numerous moons, asteroids, comets, and other '\n", + " 'celestial bodies. At its center lies the star we call '\n", + " 'the Sun.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [133.20942688,\n", + " 570.81555176,\n", + " 375.57919312,\n", + " 581.08459473],\n", + " 'page': 1,\n", + " 'span': [0, 54]}],\n", + " 'text': 'For more details about our Solar system see Chapter '\n", + " '1.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Section-header',\n", + " 'prov': [{'bbox': [133.15542603,\n", + " 542.98168945,\n", + " 167.32983398,\n", + " 554.36669922],\n", + " 'page': 1,\n", + " 'span': [0, 5]}],\n", + " 'text': 'Earth',\n", + " 'type': 'subtitle-level-1'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [132.91053772,\n", + " 512.46295166,\n", + " 477.84887695,\n", + " 534.48431396],\n", + " 'page': 1,\n", + " 'span': [0, 107]}],\n", + " 'text': \"Earth is the third planet from the Sun. It's our home \"\n", + " 'planet. Earth is the only place we know of with life.',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Text',\n", + " 'prov': [{'bbox': [133.30151367,\n", + " 494.86206055,\n", + " 240.17156982,\n", + " 505.07229614],\n", + " 'page': 1,\n", + " 'span': [0, 24]}],\n", + " 'text': 'Basic facts about Earth:',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 464.97409058,\n", + " 477.47979736,\n", + " 487.02810669],\n", + " 'page': 1,\n", + " 'span': [0, 79]}],\n", + " 'text': '· Distance from the Sun: Average of 149.6 million '\n", + " 'kilometers (93 million miles)',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 452.86901855,\n", + " 317.90722656,\n", + " 463.24041748],\n", + " 'page': 1,\n", + " 'span': [0, 37]}],\n", + " 'text': '· Rotation Period: 24 hours (one day)',\n", + " 'type': 'paragraph'},\n", + " {'name': 'List-item',\n", + " 'prov': [{'bbox': [145.94500732,\n", + " 440.71496582,\n", + " 396.66357422,\n", + " 451.19915771],\n", + " 'page': 1,\n", + " 'span': [0, 52]}],\n", + " 'text': '· Moons: One moon, called Luna or simply \"the Moon\".',\n", + " 'type': 'paragraph'},\n", + " {'name': 'Page-footer',\n", + " 'prov': [{'bbox': [303.13299561,\n", + " 87.20314026,\n", + " 308.11428833,\n", + " 96.53633118],\n", + " 'page': 1,\n", + " 'span': [0, 1]}],\n", + " 'text': '1',\n", + " 'type': 'page-footer'}],\n", + " 'page-dimensions': [{'height': 792.0, 'page': 1, 'width': 612.0}],\n", + " 'page-footers': [],\n", + " 'page-headers': [],\n", + " 'tables': [],\n", + " 'type': 'pdf-document'}\n" + ] + } + ], + "source": [ + "pprint.pprint (json.loads(output_df.iloc[1, ]['contents']))" + ] + }, + { + "cell_type": "markdown", + "id": "72274586", + "metadata": { + "id": "72274586" + }, + "source": [ + "## Step-4: Doc chunks\n", + "\n", + "In the previous step, we have extracted text from oru PDFs. But we have the content of entire file as 'one row' in our parquet output.\n", + "\n", + "In this step, we are going to split the documents in chunks, according to their layout segmentation.\n", + "\n", + "This transform uses [Quackling](https://github.com/DS4SD/quackling) `HierarchicalChunker`\n", + "to chunk according to the document layout segmentation, i.e. respecting the original document components as paragraphs, tables, enumerations, etc.\n", + "It relies on documents converted with the Docling library in the [pdf2parquet transform](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/python/README.md) using the option `contents_type: \"application/json\"`,\n", + "which provides the required JSON structure." + ] + }, + { + "cell_type": "markdown", + "id": "96198fa6", + "metadata": { + "id": "96198fa6" + }, + "source": [ + "### 4.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "305f00a3", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "305f00a3", + "outputId": "dd511f34-bab3-4dde-d938-493debb02e5e" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-2: Processing input='output/01_parquet_out' --> output='output/02_chunk_out'\n" + ] + } + ], + "source": [ + "STAGE = 2\n", + "\n", + "input_folder = output_parquet_dir # previous output folder is the input folder for the current stage\n", + "output_folder = output_chunk_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "369f2cd1", + "metadata": { + "id": "369f2cd1" + }, + "source": [ + "### 4.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "5b7b18d5", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "5b7b18d5", + "outputId": "e0b87171-9d66-473f-e66a-e4b6ae3c3f66" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "13:34:45 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'dl_min_chunk_len': None, 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox'}\n", + "13:34:45 INFO - pipeline id pipeline_id\n", + "13:34:45 INFO - code location None\n", + "13:34:45 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", + "13:34:45 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:34:45 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:34:45 INFO - orchestrator doc_chunk started at 2024-10-18 13:34:45\n", + "13:34:45 INFO - Number of files is 2, source profile {'max_file_size': 0.02239513397216797, 'min_file_size': 0.02167987823486328, 'total_file_size': 0.04407501220703125}\n", + "13:34:45 INFO - Completed 1 files (50.0%) in 0.0 min\n", + "13:34:45 INFO - Completed 2 files (100.0%) in 0.0 min\n", + "13:34:45 INFO - Done processing 2 files, waiting for flush() completion.\n", + "13:34:45 INFO - done flushing in 0.0 sec\n", + "13:34:45 INFO - Completed execution in 0.0 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:2 completed successfully\n", + "CPU times: user 826 ms, sys: 101 ms, total: 928 ms\n", + "Wall time: 923 ms\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from doc_chunk_transform_python import DocChunkPythonTransformConfiguration\n", + "\n", + "\n", + "# Prepare the commandline params\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "params = {\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # doc_chunk arguments\n", + " # ...\n", + "}\n", + "\n", + "# Pass the commandline params\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# create launcher\n", + "launcher = PythonTransformLauncher(DocChunkPythonTransformConfiguration())\n", + "# launch\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"❌ Job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "213afdf6", + "metadata": { + "id": "213afdf6" + }, + "source": [ + "### 4.3 - Inspect Generated output\n", + "\n", + "We would see documents are split into many chunks" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "d8138d43", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 897 + }, + "id": "d8138d43", + "outputId": "fd01e0cb-899e-4c73-d50e-5f4e6f5ff802" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Files processed : 2\n", + "Chunks created : 8\n", + "Input data dimensions (rows x columns)= (2, 12)\n", + "Output data dimensions (rows x columns)= (8, 16)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_id
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Solar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Solar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Mars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...
3mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Basic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...
6earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...
7earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...
\n", + "
" ], - "source": [ - "output_df[['filename', 'contents']]" - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "id": "cc61dffa", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "cc61dffa", - "outputId": "7fb26043-8538-48b6-80b7-16ceb818c1a8" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "========== mars.pdf ===========\n", - "-------Chunk 0------\n", - "Solar System\n", - "For more details about the Solar system see Chapter 1.\n", - "-------\n", - "-------Chunk 1------\n", - "Mars\n", - "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", - "-------\n", - "-------Chunk 2------\n", - "Basic facts about Mars:\n", - "· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", - "· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", - "· Moons: Two small moons, Phobos and Deimos.\n", - "-------\n", - "========== earth.pdf ===========\n", - "-------Chunk 0------\n", - "Solar System\n", - "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", - "-------\n", - "-------Chunk 1------\n", - "Solar System\n", - "For more details about our Solar system see Chapter 1.\n", - "-------\n", - "-------Chunk 2------\n", - "Earth\n", - "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", - "-------\n", - "-------Chunk 3------\n", - "Earth\n", - "Basic facts about Earth:\n", - "· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", - "· Rotation Period: 24 hours (one day)\n", - "· Moons: One moon, called Luna or simply \"the Moon\".\n", - "-------\n" - ] - } + "text/plain": [ + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 mars.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", + "6 earth.pdf 1 0 11 pdf \n", + "7 earth.pdf 1 0 11 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "1 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "2 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "3 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "4 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "5 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "6 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "7 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "1 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "2 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "3 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "4 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "5 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "6 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "7 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "3 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", + "4 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "5 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "7 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox \\\n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", + "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", + "\n", + " document_id \n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", + "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", + "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... " + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (f\"Files processed : {input_df.shape[0]:,}\")\n", + "print (f\"Chunks created : {output_df.shape[0]:,}\")\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "9e9ca75c", + "metadata": { + "id": "9e9ca75c" + }, + "source": [ + "### 4.4 - Understanding the Output\n", + "\n", + "Here we see 2 PDF files are split into 6 chunks. Basically we see the documents are being split along 'natural boundaris' - paragraphs and bullet points\n", + "\n", + "See how **document_id** is carried throughout. This helps us identify original documents.\n", + "\n", + "Also note **contents** is now plain text (not JSON as before)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "3090c950", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 300 + }, + "id": "3090c950", + "outputId": "0f4b6771-8d38-4a27-c756-21f916b23a4f" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontents
0mars.pdfSolar System\\nOur solar system is a vast and f...
1mars.pdfSolar System\\nFor more details about the Solar...
2mars.pdfMars\\nMars, the fourth planet from the Sun, is...
3mars.pdfBasic facts about Mars:\\n· Distance from the S...
4earth.pdfSolar System\\nOur solar system is a vast and f...
5earth.pdfSolar System\\nFor more details about our Solar...
6earth.pdfEarth\\nEarth is the third planet from the Sun....
7earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...
\n", + "
" ], - "source": [ - "for f in output_df['filename'].unique():\n", - " print ('==========' , f, '===========')\n", - " chunks = output_df[output_df['filename'] == f]['contents']\n", - " for idx , chunk in enumerate(chunks):\n", - " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" - ] - }, - { - "cell_type": "markdown", - "id": "383f40ba", - "metadata": { - "id": "383f40ba" - }, - "source": [ - "### 6.4 - Understanding the output\n", - "\n", - "Remember we had 8 chunks initially. Now we have 7! One duplicate chunk is removed.\n", - "\n", - "If you look at the PDF, the following common paragraph in `earth.pdf` and `mars.pdf` is removed from one of the documents! Pretty neat, eh!\n", - "\n", - "```text\n", - "## Solar System\n", - "\n", - "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "85309751-8556-41c6-ac32-84acc941bc8d", - "metadata": { - "id": "85309751-8556-41c6-ac32-84acc941bc8d" - }, - "source": [ - " ## Step-7: Fuzzy Dedup\n", - "\n", - "And fuzzy dedupe is only available in RAY version. So we will skip it here\n", - "\n", - "See this file [dpk_intro_1_ray.ipynb](dpk_intro_1_ray.ipynb)" - ] - }, - { - "cell_type": "markdown", - "id": "5370950a-2a3a-4143-8218-f9b4808099ba", - "metadata": { - "id": "5370950a-2a3a-4143-8218-f9b4808099ba" - }, - "source": [ - "## Step-8: Text encoding\n", - "\n", - "Encode text for the vector storage." - ] - }, - { - "cell_type": "markdown", - "id": "85aba685", - "metadata": { - "id": "85aba685" - }, - "source": [ - "### 8.1 - Set Input/output Folder" - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "id": "20a153fa-fd56-401e-86be-4f7617affcc8", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "20a153fa-fd56-401e-86be-4f7617affcc8", - "outputId": "41d268f5-7cc6-432e-d56e-2ba882fbdba6" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "🏃🏼 STAGE-6: Processing input='output/04_exact_dedupe_out' --> output='output/05_embeddings_out'\n" - ] - } + "text/plain": [ + " filename contents\n", + "0 mars.pdf Solar System\\nOur solar system is a vast and f...\n", + "1 mars.pdf Solar System\\nFor more details about the Solar...\n", + "2 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", + "3 mars.pdf Basic facts about Mars:\\n· Distance from the S...\n", + "4 earth.pdf Solar System\\nOur solar system is a vast and f...\n", + "5 earth.pdf Solar System\\nFor more details about our Solar...\n", + "6 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", + "7 earth.pdf Earth\\nBasic facts about Earth:\\n· Distance fr..." + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output_df[['filename', 'contents']]" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "d5f151ae", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "d5f151ae", + "outputId": "a4c491b2-53db-4d71-da24-4479de8d1d65" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "========== mars.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "-------\n", + "-------Chunk 1------\n", + "Solar System\n", + "For more details about the Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", + "Mars\n", + "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", + "-------\n", + "-------Chunk 3------\n", + "Basic facts about Mars:\n", + "· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", + "· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", + "· Moons: Two small moons, Phobos and Deimos.\n", + "-------\n", + "========== earth.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "-------\n", + "-------Chunk 1------\n", + "Solar System\n", + "For more details about our Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", + "Earth\n", + "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", + "-------\n", + "-------Chunk 3------\n", + "Earth\n", + "Basic facts about Earth:\n", + "· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", + "· Rotation Period: 24 hours (one day)\n", + "· Moons: One moon, called Luna or simply \"the Moon\".\n", + "-------\n" + ] + } + ], + "source": [ + "for f in output_df['filename'].unique():\n", + " print ('==========' , f, '===========')\n", + " chunks = output_df[output_df['filename'] == f]['contents']\n", + " for idx , chunk in enumerate(chunks):\n", + " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" + ] + }, + { + "cell_type": "markdown", + "id": "7ad1c60d", + "metadata": { + "id": "7ad1c60d" + }, + "source": [ + "## Step-5: DOC ID generation of Chunks\n", + "\n", + "This transform annotates documents with document \"ids\". It supports the following transformations of the original data:\n", + "\n", + " - Adding document hash: this enables the addition of a document hash-based id to the data. The hash is calculated with `hashlib.sha256(doc.encode(\"utf-8\")).hexdigest()`. To enable this annotation, set **hash_column** to the name of the column, where you want to store it.\n", + " - Adding integer document id: this allows the addition of an integer document id to the data that is unique across all rows in all tables provided to the transform() method. To enable this annotation, set **int_id_column** to the name of the column, where you want to store it.\n", + "\n", + "**This is a pre-requisite for fuzzy dedup** in the pipeline." + ] + }, + { + "cell_type": "markdown", + "id": "1afaa0fd", + "metadata": { + "id": "1afaa0fd" + }, + "source": [ + "### 5.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "6ffd6f54", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "6ffd6f54", + "outputId": "1784c80d-6309-4913-9f55-c018b978968f" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-3: Processing input='output/02_chunk_out' --> output='output/03_docid_out'\n" + ] + } + ], + "source": [ + "\n", + "# Input for this stage is the output of exact dedeup component\n", + "# output of this component makes it possible for fdedup component to run on data.\n", + "\n", + "STAGE = 3\n", + "\n", + "input_folder = output_chunk_dir\n", + "output_folder = output_docid_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "f78a51b7", + "metadata": { + "id": "f78a51b7" + }, + "source": [ + "### 5.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "5fc77557", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "5fc77557", + "outputId": "db2b8670-543e-4073-9c7d-3f9ef5f4317e" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "13:34:45 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", + "13:34:45 INFO - pipeline id pipeline_id\n", + "13:34:45 INFO - code location None\n", + "13:34:45 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", + "13:34:45 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:34:45 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:34:45 INFO - orchestrator doc_id started at 2024-10-18 13:34:45\n", + "13:34:45 INFO - Number of files is 2, source profile {'max_file_size': 0.008975982666015625, 'min_file_size': 0.008897781372070312, 'total_file_size': 0.017873764038085938}\n", + "13:34:45 INFO - Completed 1 files (50.0%) in 0.0 min\n", + "13:34:45 INFO - Completed 2 files (100.0%) in 0.0 min\n", + "13:34:45 INFO - Done processing 2 files, waiting for flush() completion.\n", + "13:34:45 INFO - done flushing in 0.0 sec\n", + "13:34:45 INFO - Completed execution in 0.0 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:3 completed successfully\n", + "CPU times: user 12.8 ms, sys: 3.7 ms, total: 16.5 ms\n", + "Wall time: 13.1 ms\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from doc_id_transform_python import DocIDPythonTransformRuntimeConfiguration\n", + "\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "params = {\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # orchestrator\n", + " # doc id configuration\n", + " \"doc_id_doc_column\": \"contents\",\n", + " \"doc_id_hash_column\": \"chunk_hash\",\n", + " \"doc_id_int_column\": \"chunk_id\",\n", + "}\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# launch\n", + "\n", + "launcher = PythonTransformLauncher(DocIDPythonTransformRuntimeConfiguration())\n", + "\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"❌ Job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "a9a8c1fa", + "metadata": { + "id": "a9a8c1fa" + }, + "source": [ + "### 5.3 - Inspect Generated output\n", + "\n", + "You will notice we have two extra columns\n", + "\n", + "- **hash_column**\n", + "- **int_id_column**\n", + "\n", + "But still the same number or rows as before" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "da9adede", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 860 + }, + "id": "da9adede", + "outputId": "036db4ca-12f6-4b3e-9d7f-fa70e494870d" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (8, 16)\n", + "Output data dimensions (rows x columns)= (8, 18)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_hashchunk_id
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Solar System\\nOur solar system is a vast and f...$.main-text[2]1[132.84518433, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...4
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Solar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...5
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Mars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...a31663e06fac41470ecc459f5a58658a3f9997d7801053...6
3mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Basic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...1
6earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...2
7earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...3
\n", + "
" ], - "source": [ - "STAGE = 6\n", - "\n", - "input_folder = output_exact_dedupe_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_embeddings_dir\n", - "\n", - "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", - "\n", - "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" - ] - }, - { - "cell_type": "markdown", - "id": "c97545f4", - "metadata": { - "id": "c97545f4" - }, - "source": [ - "### 8.2 - Execute" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "id": "228df6b2-bc62-494b-9697-03ece98d7853", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "228df6b2-bc62-494b-9697-03ece98d7853", - "outputId": "b2119b07-0654-45cd-f729-1396e18b24b1" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "22:43:10 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", - "22:43:10 INFO - pipeline id pipeline_id\n", - "22:43:10 INFO - code location None\n", - "22:43:10 INFO - data factory data_ is using local data access: input_folder - output/04_exact_dedupe_out output_folder - output/05_embeddings_out\n", - "22:43:10 INFO - data factory data_ max_files -1, n_sample -1\n", - "22:43:10 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "22:43:10 INFO - orchestrator text_encoder started at 2024-10-16 22:43:10\n", - "22:43:10 INFO - Number of files is 2, source profile {'max_file_size': 0.010450363159179688, 'min_file_size': 0.010318756103515625, 'total_file_size': 0.020769119262695312}\n", - "22:43:12 INFO - Completed 1 files (50.0%) in 0.004 min\n", - "22:43:12 INFO - Completed 2 files (100.0%) in 0.004 min\n", - "22:43:12 INFO - Done processing 2 files, waiting for flush() completion.\n", - "22:43:12 INFO - done flushing in 0.0 sec\n", - "22:43:12 INFO - Completed execution in 0.039 min, execution result 0\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Stage:6 completed successfully\n", - "CPU times: user 671 ms, sys: 230 ms, total: 901 ms\n", - "Wall time: 2.8 s\n" - ] - } + "text/plain": [ + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 mars.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", + "6 earth.pdf 1 0 11 pdf \n", + "7 earth.pdf 1 0 11 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "1 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "2 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "3 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "4 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "5 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "6 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "7 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "1 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "2 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "3 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "4 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "5 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "6 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "7 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "3 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", + "4 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "5 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "7 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox \\\n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", + "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", + "\n", + " document_id \\\n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", + "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", + "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", + "\n", + " chunk_hash chunk_id \n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", + "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", + "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", + "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", + "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 " + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "4692975c-49ff-41ae-810e-0f5bc0bbdc53", + "metadata": { + "id": "4692975c-49ff-41ae-810e-0f5bc0bbdc53" + }, + "source": [ + "## Step-6: Exact Dedup\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "5acfd3a2-a236-4143-bcfc-15804f1da7fe", + "metadata": { + "id": "5acfd3a2-a236-4143-bcfc-15804f1da7fe" + }, + "source": [ + "### 6.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "4c7a1b94", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "4c7a1b94", + "outputId": "2f6f05bc-f6fd-4d66-ea01-ed89cd5b80f3" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-4: Processing input='output/03_docid_out' --> output='output/04_exact_dedupe_out'\n" + ] + } + ], + "source": [ + "STAGE = 4\n", + "\n", + "input_folder = output_docid_dir # previous output folder is the input folder for the current stage\n", + "output_folder = output_exact_dedupe_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "3661cb37-39c7-4b09-a784-925bfa9eaf1e", + "metadata": { + "id": "3661cb37-39c7-4b09-a784-925bfa9eaf1e" + }, + "source": [ + "### 6.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", + "outputId": "74dc0b75-58b5-4c97-9965-91315e8a98a5" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "13:34:45 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None}\n", + "13:34:45 INFO - pipeline id pipeline_id\n", + "13:34:45 INFO - code location None\n", + "13:34:45 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", + "13:34:45 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:34:45 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:34:45 INFO - orchestrator ededup started at 2024-10-18 13:34:45\n", + "13:34:45 INFO - Number of files is 2, source profile {'max_file_size': 0.010180473327636719, 'min_file_size': 0.010101318359375, 'total_file_size': 0.02028179168701172}\n", + "13:34:45 INFO - Starting from the beginning\n", + "13:34:45 INFO - Completed 1 files (50.0%) in 0.0 min\n", + "13:34:45 INFO - Completed 2 files (100.0%) in 0.0 min\n", + "13:34:45 INFO - Done processing 2 files, waiting for flush() completion.\n", + "13:34:45 INFO - done flushing in 0.0 sec\n", + "13:34:45 INFO - Completed execution in 0.0 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:4 completed successfully\n", + "CPU times: user 17.6 ms, sys: 997 μs, total: 18.6 ms\n", + "Wall time: 15.2 ms\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from ededup_transform_python import EdedupPythonTransformRuntimeConfiguration\n", + "\n", + "\n", + "# Prepare the commandline params\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "params = {\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # ededup parameters\n", + " \"ededup_doc_column\": \"contents\",\n", + " \"ededup_doc_id_column\": \"chunk_hash\",\n", + "}\n", + "\n", + "# Pass the commandline params\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "\n", + "# create launcher\n", + "launcher = PythonTransformLauncher(EdedupPythonTransformRuntimeConfiguration())\n", + "# launch\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"❌ Job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "eaf1c3c3", + "metadata": { + "id": "eaf1c3c3" + }, + "source": [ + "### 6.3 - Inspect Generated output" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "d824ebf6", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 815 + }, + "id": "d824ebf6", + "outputId": "68f55770-c750-4607-a205-ba183603019d" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (8, 18)\n", + "Output data dimensions (rows x columns)= (7, 19)\n", + "Input chunks before exact dedupe : 8\n", + "Output chunks after exact dedupe : 7\n", + "Duplicate chunks removed : 1\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_hashchunk_idremoved
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Solar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...5[44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Mars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...a31663e06fac41470ecc459f5a58658a3f9997d7801053...6[]
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Basic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7[]
3earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0[]
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...1[]
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...2[]
6earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...3[]
\n", + "
" ], - "source": [ - "%%time\n", - "\n", - "from data_processing.runtime.pure_python import PythonTransformLauncher\n", - "from text_encoder_local_python import TextEncoderPythonTransformConfiguration\n", - "\n", - "local_conf = {\n", - " \"input_folder\": input_folder,\n", - " \"output_folder\": output_folder,\n", - "}\n", - "params = {\n", - " # Data access. Only required parameters are specified\n", - " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # text_encoder\n", - " \"text_encoder_model_name\": MY_CONFIG.EMBEDDING_MODEL,\n", - "}\n", - "\n", - "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "# create launcher\n", - "launcher = PythonTransformLauncher(TextEncoderPythonTransformConfiguration())\n", - "\n", - "return_code = launcher.launch()\n", - "\n", - "if return_code == 0:\n", - " print (f\"✅ Stage:{STAGE} completed successfully\")\n", - "else:\n", - " raise Exception (\"❌ Job failed\")" - ] - }, - { - "cell_type": "markdown", - "id": "b734852c", - "metadata": { - "id": "b734852c" - }, - "source": [ - "### 8.3 - Inspect Generated output\n", - "\n", - "You will see a column called `embeddings` added at the end. This the text content converted into vectors or embeddings. We used the model `sentence-transformers/all-MiniLM-L6-v2`" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "id": "7b1c1d09", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 760 - }, - "id": "7b1c1d09", - "outputId": "018daa18-e5db-4483-d8d5-30aded80d5e3" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Input data dimensions (rows x columns)= (7, 19)\n", - "Output data dimensions (rows x columns)= (7, 20)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_hashchunk_idremovedembeddings
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcSolar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...5[44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...[-0.051861435, 0.0035226212, 0.030617002, 0.04...
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcMars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...a31663e06fac41470ecc459f5a58658a3f9997d7801053...6[][0.07728295, 0.024970993, -0.043180738, 0.0580...
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-16T22:43:08.0480350.827872mars.pdf07bc0c9a-f863-48e3-9aed-bd289af040bcBasic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7[][0.10598018, 0.025460618, 0.023627337, 0.03905...
3earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0[][0.0077404436, -0.02055944, 0.026426593, 0.011...
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...1[][-0.062105548, -0.0053322907, 0.031277698, 0.0...
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...2[][0.072435796, -0.058001805, -0.019771898, -0.0...
6earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-16T22:43:07.2053500.921915earth.pdfe141f7a4-3e45-4f04-88d3-60e0a81b195bEarth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...3[][0.091821924, 0.015197902, 0.07716932, 0.01711...
\n", - "
" - ], - "text/plain": [ - " filename num_pages num_tables num_doc_elements ext \\\n", - "0 mars.pdf 1 0 11 pdf \n", - "1 mars.pdf 1 0 11 pdf \n", - "2 mars.pdf 1 0 11 pdf \n", - "3 earth.pdf 1 0 11 pdf \n", - "4 earth.pdf 1 0 11 pdf \n", - "5 earth.pdf 1 0 11 pdf \n", - "6 earth.pdf 1 0 11 pdf \n", - "\n", - " hash size \\\n", - "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", - "\n", - " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "1 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "2 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", - "3 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "4 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "5 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "6 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", - "\n", - " source_document_id \\\n", - "0 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "1 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "2 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", - "3 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "4 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "5 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "6 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", - "\n", - " contents doc_jsonpath \\\n", - "0 Solar System\\nFor more details about the Solar... $.main-text[3] \n", - "1 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "2 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", - "3 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "4 Solar System\\nFor more details about our Solar... $.main-text[3] \n", - "5 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", - "6 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", - "\n", - " page_number bbox \\\n", - "0 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", - "1 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", - "2 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", - "3 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", - "4 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", - "5 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", - "6 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", - "\n", - " document_id \\\n", - "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", - "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", - "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", - "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", - "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", - "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", - "\n", - " chunk_hash chunk_id \\\n", - "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", - "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", - "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", - "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", - "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", - "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", - "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 \n", - "\n", - " removed \\\n", - "0 [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567... \n", - "1 [] \n", - "2 [] \n", - "3 [] \n", - "4 [] \n", - "5 [] \n", - "6 [] \n", - "\n", - " embeddings \n", - "0 [-0.051861435, 0.0035226212, 0.030617002, 0.04... \n", - "1 [0.07728295, 0.024970993, -0.043180738, 0.0580... \n", - "2 [0.10598018, 0.025460618, 0.023627337, 0.03905... \n", - "3 [0.0077404436, -0.02055944, 0.026426593, 0.011... \n", - "4 [-0.062105548, -0.0053322907, 0.031277698, 0.0... \n", - "5 [0.072435796, -0.058001805, -0.019771898, -0.0... \n", - "6 [0.091821924, 0.015197902, 0.07716932, 0.01711... " - ] - }, - "execution_count": 28, - "metadata": {}, - "output_type": "execute_result" - } + "text/plain": [ + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 earth.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", + "6 earth.pdf 1 0 11 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "1 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "2 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "3 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "4 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "5 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "6 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "1 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "2 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "3 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "4 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "5 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "6 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "1 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "2 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", + "3 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "4 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "5 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "6 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox \\\n", + "0 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "1 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "2 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "3 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "4 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "5 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "6 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", + "\n", + " document_id \\\n", + "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", + "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", + "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", + "\n", + " chunk_hash chunk_id \\\n", + "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", + "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", + "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", + "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", + "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 \n", + "\n", + " removed \n", + "0 [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567... \n", + "1 [] \n", + "2 [] \n", + "3 [] \n", + "4 [] \n", + "5 [] \n", + "6 [] " + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "print (f\"Input chunks before exact dedupe : {input_df.shape[0]:,}\")\n", + "print (f\"Output chunks after exact dedupe : {output_df.shape[0]:,}\")\n", + "print (\"Duplicate chunks removed : \", (input_df.shape[0] - output_df.shape[0]))\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "82cc9bb0", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 269 + }, + "id": "82cc9bb0", + "outputId": "46d9e91d-c470-4e3e-e5c8-508c534dbceb" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamecontents
0mars.pdfSolar System\\nFor more details about the Solar...
1mars.pdfMars\\nMars, the fourth planet from the Sun, is...
2mars.pdfBasic facts about Mars:\\n· Distance from the S...
3earth.pdfSolar System\\nOur solar system is a vast and f...
4earth.pdfSolar System\\nFor more details about our Solar...
5earth.pdfEarth\\nEarth is the third planet from the Sun....
6earth.pdfEarth\\nBasic facts about Earth:\\n· Distance fr...
\n", + "
" ], - "source": [ - "from my_utils import read_parquet_files_as_df\n", - "\n", - "output_df = read_parquet_files_as_df(output_folder)\n", - "\n", - "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", - "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "\n", - "output_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "id": "f5e12630-be6b-4188-a925-77117155617b", - "metadata": { - "id": "f5e12630-be6b-4188-a925-77117155617b" - }, - "source": [ - "## Step-9: Copy output to final output dir" - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", - "outputId": "31f09b58-7b2d-48bb-9dac-bc0ba9625c01" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "✅ Copied output from 'output/05_embeddings_out' --> 'output/output_final'\n" - ] - } + "text/plain": [ + " filename contents\n", + "0 mars.pdf Solar System\\nFor more details about the Solar...\n", + "1 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", + "2 mars.pdf Basic facts about Mars:\\n· Distance from the S...\n", + "3 earth.pdf Solar System\\nOur solar system is a vast and f...\n", + "4 earth.pdf Solar System\\nFor more details about our Solar...\n", + "5 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", + "6 earth.pdf Earth\\nBasic facts about Earth:\\n· Distance fr..." + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output_df[['filename', 'contents']]" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "cc61dffa", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cc61dffa", + "outputId": "7fb26043-8538-48b6-80b7-16ceb818c1a8" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "========== mars.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "For more details about the Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 1------\n", + "Mars\n", + "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", + "-------\n", + "-------Chunk 2------\n", + "Basic facts about Mars:\n", + "· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", + "· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", + "· Moons: Two small moons, Phobos and Deimos.\n", + "-------\n", + "========== earth.pdf ===========\n", + "-------Chunk 0------\n", + "Solar System\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "-------\n", + "-------Chunk 1------\n", + "Solar System\n", + "For more details about our Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", + "Earth\n", + "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", + "-------\n", + "-------Chunk 3------\n", + "Earth\n", + "Basic facts about Earth:\n", + "· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", + "· Rotation Period: 24 hours (one day)\n", + "· Moons: One moon, called Luna or simply \"the Moon\".\n", + "-------\n" + ] + } + ], + "source": [ + "for f in output_df['filename'].unique():\n", + " print ('==========' , f, '===========')\n", + " chunks = output_df[output_df['filename'] == f]['contents']\n", + " for idx , chunk in enumerate(chunks):\n", + " print (f'-------Chunk {idx}------\\n{chunk}\\n-------')" + ] + }, + { + "cell_type": "markdown", + "id": "383f40ba", + "metadata": { + "id": "383f40ba" + }, + "source": [ + "### 6.4 - Understanding the output\n", + "\n", + "Remember we had 8 chunks initially. Now we have 7! One duplicate chunk is removed.\n", + "\n", + "If you look at the PDF, the following common paragraph in `earth.pdf` and `mars.pdf` is removed from one of the documents! Pretty neat, eh!\n", + "\n", + "```text\n", + "## Solar System\n", + "\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "85309751-8556-41c6-ac32-84acc941bc8d", + "metadata": { + "id": "85309751-8556-41c6-ac32-84acc941bc8d" + }, + "source": [ + " ## Step-7: Fuzzy Dedup\n", + "\n", + "And fuzzy dedupe is only available in RAY version. So we will skip it here\n", + "\n", + "See this file [dpk_intro_1_ray.ipynb](dpk_intro_1_ray.ipynb)" + ] + }, + { + "cell_type": "markdown", + "id": "5370950a-2a3a-4143-8218-f9b4808099ba", + "metadata": { + "id": "5370950a-2a3a-4143-8218-f9b4808099ba" + }, + "source": [ + "## Step-8: Text encoding\n", + "\n", + "Encode text for the vector storage." + ] + }, + { + "cell_type": "markdown", + "id": "85aba685", + "metadata": { + "id": "85aba685" + }, + "source": [ + "### 8.1 - Set Input/output Folder" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "20a153fa-fd56-401e-86be-4f7617affcc8", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "20a153fa-fd56-401e-86be-4f7617affcc8", + "outputId": "41d268f5-7cc6-432e-d56e-2ba882fbdba6" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "🏃🏼 STAGE-6: Processing input='output/04_exact_dedupe_out' --> output='output/05_embeddings_out'\n" + ] + } + ], + "source": [ + "STAGE = 6\n", + "\n", + "input_folder = output_exact_dedupe_dir # previous output folder is the input folder for the current stage\n", + "output_folder = output_embeddings_dir\n", + "\n", + "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", + "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" + ] + }, + { + "cell_type": "markdown", + "id": "c97545f4", + "metadata": { + "id": "c97545f4" + }, + "source": [ + "### 8.2 - Execute" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "228df6b2-bc62-494b-9697-03ece98d7853", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "228df6b2-bc62-494b-9697-03ece98d7853", + "outputId": "b2119b07-0654-45cd-f729-1396e18b24b1" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "13:34:45 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", + "13:34:45 INFO - pipeline id pipeline_id\n", + "13:34:45 INFO - code location None\n", + "13:34:45 INFO - data factory data_ is using local data access: input_folder - output/04_exact_dedupe_out output_folder - output/05_embeddings_out\n", + "13:34:45 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:34:45 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:34:45 INFO - orchestrator text_encoder started at 2024-10-18 13:34:45\n", + "13:34:45 INFO - Number of files is 2, source profile {'max_file_size': 0.010450363159179688, 'min_file_size': 0.010318756103515625, 'total_file_size': 0.020769119262695312}\n", + "13:34:47 INFO - Completed 1 files (50.0%) in 0.004 min\n", + "13:34:47 INFO - Completed 2 files (100.0%) in 0.005 min\n", + "13:34:47 INFO - Done processing 2 files, waiting for flush() completion.\n", + "13:34:47 INFO - done flushing in 0.0 sec\n", + "13:34:47 INFO - Completed execution in 0.034 min, execution result 0\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Stage:6 completed successfully\n", + "CPU times: user 615 ms, sys: 146 ms, total: 761 ms\n", + "Wall time: 2.24 s\n" + ] + } + ], + "source": [ + "%%time\n", + "\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from text_encoder_local_python import TextEncoderPythonTransformConfiguration\n", + "\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "params = {\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # text_encoder\n", + " \"text_encoder_model_name\": MY_CONFIG.EMBEDDING_MODEL,\n", + "}\n", + "\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "# create launcher\n", + "launcher = PythonTransformLauncher(TextEncoderPythonTransformConfiguration())\n", + "\n", + "return_code = launcher.launch()\n", + "\n", + "if return_code == 0:\n", + " print (f\"✅ Stage:{STAGE} completed successfully\")\n", + "else:\n", + " raise Exception (\"❌ Job failed\")" + ] + }, + { + "cell_type": "markdown", + "id": "b734852c", + "metadata": { + "id": "b734852c" + }, + "source": [ + "### 8.3 - Inspect Generated output\n", + "\n", + "You will see a column called `embeddings` added at the end. This the text content converted into vectors or embeddings. We used the model `sentence-transformers/all-MiniLM-L6-v2`" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "7b1c1d09", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 760 + }, + "id": "7b1c1d09", + "outputId": "018daa18-e5db-4483-d8d5-30aded80d5e3" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Input data dimensions (rows x columns)= (7, 19)\n", + "Output data dimensions (rows x columns)= (7, 20)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
filenamenum_pagesnum_tablesnum_doc_elementsexthashsizedate_acquiredpdf_convert_timesource_filenamesource_document_idcontentsdoc_jsonpathpage_numberbboxdocument_idchunk_hashchunk_idremovedembeddings
0mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Solar System\\nFor more details about the Solar...$.main-text[3]1[133.18510437, 570.83258057, 374.99838257, 581...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...5[44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...[-0.051861435, 0.0035226212, 0.030617002, 0.04...
1mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Mars\\nMars, the fourth planet from the Sun, is...$.main-text[5]1[132.87440491, 500.84011841, 477.48345947, 534...a31663e06fac41470ecc459f5a58658a3f9997d7801053...a31663e06fac41470ecc459f5a58658a3f9997d7801053...6[][0.07728295, 0.024970993, -0.043180738, 0.0580...
2mars.pdf1011pdf8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...28002024-10-18T13:34:44.2595450.845978mars.pdf6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2Basic facts about Mars:\\n· Distance from the S...$.main-text[6]1[133.2026062, 482.90710449, 237.04431152, 493....7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...7[][0.10598018, 0.025460618, 0.023627337, 0.03905...
3earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nOur solar system is a vast and f...$.main-text[2]1[132.87112427, 588.96014404, 479.40917969, 623...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...0[][0.0077404436, -0.02055944, 0.026426593, 0.011...
4earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cSolar System\\nFor more details about our Solar...$.main-text[3]1[133.20942688, 570.81555176, 375.57919312, 581...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...1[][-0.062105548, -0.0053322907, 0.031277698, 0.0...
5earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nEarth is the third planet from the Sun....$.main-text[5]1[132.91053772, 512.46295166, 477.84887695, 534...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...2[][0.072435796, -0.058001805, -0.019771898, -0.0...
6earth.pdf1011pdf18713f970989055625bef22209b6f4b6830b9ca22046bf...26862024-10-18T13:34:43.4102970.794765earth.pdfefbdbcb9-f0af-42f0-b191-2f14ce3ddc7cEarth\\nBasic facts about Earth:\\n· Distance fr...$.main-text[6]1[133.30151367, 494.86206055, 240.17156982, 505...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...3[][0.091821924, 0.015197902, 0.07716932, 0.01711...
\n", + "
" ], - "source": [ - "import shutil\n", - "\n", - "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER_FINAL, ignore_errors=True)\n", - "shutil.copytree(src=output_folder, dst=MY_CONFIG.OUTPUT_FOLDER_FINAL)\n", - "\n", - "print (f\"✅ Copied output from '{output_folder}' --> '{MY_CONFIG.OUTPUT_FOLDER_FINAL}'\")" - ] + "text/plain": [ + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 earth.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", + "6 earth.pdf 1 0 11 pdf \n", + "\n", + " hash size \\\n", + "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "\n", + " date_acquired pdf_convert_time source_filename \\\n", + "0 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "1 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "2 2024-10-18T13:34:44.259545 0.845978 mars.pdf \n", + "3 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "4 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "5 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "6 2024-10-18T13:34:43.410297 0.794765 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "1 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "2 6e9fd08a-a4e2-47da-b5a9-bb1e1a3ab6e2 \n", + "3 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "4 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "5 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "6 efbdbcb9-f0af-42f0-b191-2f14ce3ddc7c \n", + "\n", + " contents doc_jsonpath \\\n", + "0 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "1 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "2 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", + "3 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "4 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "5 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "6 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox \\\n", + "0 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "1 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "2 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "3 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "4 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "5 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "6 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", + "\n", + " document_id \\\n", + "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", + "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", + "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", + "\n", + " chunk_hash chunk_id \\\n", + "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", + "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", + "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", + "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", + "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 \n", + "\n", + " removed \\\n", + "0 [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567... \n", + "1 [] \n", + "2 [] \n", + "3 [] \n", + "4 [] \n", + "5 [] \n", + "6 [] \n", + "\n", + " embeddings \n", + "0 [-0.051861435, 0.0035226212, 0.030617002, 0.04... \n", + "1 [0.07728295, 0.024970993, -0.043180738, 0.0580... \n", + "2 [0.10598018, 0.025460618, 0.023627337, 0.03905... \n", + "3 [0.0077404436, -0.02055944, 0.026426593, 0.011... \n", + "4 [-0.062105548, -0.0053322907, 0.031277698, 0.0... \n", + "5 [0.072435796, -0.058001805, -0.019771898, -0.0... \n", + "6 [0.091821924, 0.015197902, 0.07716932, 0.01711... " + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" } - ], - "metadata": { + ], + "source": [ + "from my_utils import read_parquet_files_as_df\n", + "\n", + "output_df = read_parquet_files_as_df(output_folder)\n", + "\n", + "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", + "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", + "\n", + "output_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "f5e12630-be6b-4188-a925-77117155617b", + "metadata": { + "id": "f5e12630-be6b-4188-a925-77117155617b" + }, + "source": [ + "## Step-9: Copy output to final output dir" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", + "metadata": { "colab": { - "provenance": [] - }, - "kernelspec": { - "display_name": "dpk-1-basic-022dev1-py312", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.7" - }, - "widgets": { - "application/vnd.jupyter.widget-state+json": { - "06f9b33494984e4885d5aad813d1d2bc": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "DescriptionStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "1cb3bbf7d724411cbe9831543a4aecc0": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "553f3c16839a49d79591d0fc4862bed6": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "7053c9606a414e978636a7e241909504": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HTMLModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_1cb3bbf7d724411cbe9831543a4aecc0", - "placeholder": "​", - "style": "IPY_MODEL_06f9b33494984e4885d5aad813d1d2bc", - "value": " 10/10 [00:00<00:00, 349.38it/s]" - } - }, - "724778729161445c98b187031ae4f67c": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "ProgressStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "97b603697cfa4b4ea4e6735b6768ca35": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HBoxModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_e87e8d3262c54cfaaa8768505edacda3", - "IPY_MODEL_b78aa40816e44f7fbebcb24ca68818b3", - "IPY_MODEL_7053c9606a414e978636a7e241909504" - ], - "layout": "IPY_MODEL_da0787b239764847a731083997780a85" - } - }, - "9d184ed175f0403fb03c2e13dfd04e0a": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "b78aa40816e44f7fbebcb24ca68818b3": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "FloatProgressModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_9d184ed175f0403fb03c2e13dfd04e0a", - "max": 10, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_724778729161445c98b187031ae4f67c", - "value": 10 - } - }, - "c0eb5bc8f6ee427ca42204b3c56f9a4e": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "DescriptionStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "da0787b239764847a731083997780a85": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "e87e8d3262c54cfaaa8768505edacda3": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HTMLModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_553f3c16839a49d79591d0fc4862bed6", - "placeholder": "​", - "style": "IPY_MODEL_c0eb5bc8f6ee427ca42204b3c56f9a4e", - "value": "Fetching 10 files: 100%" - } - } - } + "base_uri": "https://localhost:8080/" + }, + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", + "outputId": "31f09b58-7b2d-48bb-9dac-bc0ba9625c01" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Copied output from 'output/05_embeddings_out' --> 'output/output_final'\n" + ] } + ], + "source": [ + "import shutil\n", + "\n", + "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER_FINAL, ignore_errors=True)\n", + "shutil.copytree(src=output_folder, dst=MY_CONFIG.OUTPUT_FOLDER_FINAL)\n", + "\n", + "print (f\"✅ Copied output from '{output_folder}' --> '{MY_CONFIG.OUTPUT_FOLDER_FINAL}'\")" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "dpk-2-basic-021-py311", + "language": "python", + "name": "python3" }, - "nbformat": 4, - "nbformat_minor": 5 + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.10" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "06f9b33494984e4885d5aad813d1d2bc": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "1cb3bbf7d724411cbe9831543a4aecc0": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "553f3c16839a49d79591d0fc4862bed6": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "7053c9606a414e978636a7e241909504": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_1cb3bbf7d724411cbe9831543a4aecc0", + "placeholder": "​", + "style": "IPY_MODEL_06f9b33494984e4885d5aad813d1d2bc", + "value": " 10/10 [00:00<00:00, 349.38it/s]" + } + }, + "724778729161445c98b187031ae4f67c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "97b603697cfa4b4ea4e6735b6768ca35": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_e87e8d3262c54cfaaa8768505edacda3", + "IPY_MODEL_b78aa40816e44f7fbebcb24ca68818b3", + "IPY_MODEL_7053c9606a414e978636a7e241909504" + ], + "layout": "IPY_MODEL_da0787b239764847a731083997780a85" + } + }, + "9d184ed175f0403fb03c2e13dfd04e0a": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "b78aa40816e44f7fbebcb24ca68818b3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_9d184ed175f0403fb03c2e13dfd04e0a", + "max": 10, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_724778729161445c98b187031ae4f67c", + "value": 10 + } + }, + "c0eb5bc8f6ee427ca42204b3c56f9a4e": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "da0787b239764847a731083997780a85": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "e87e8d3262c54cfaaa8768505edacda3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_553f3c16839a49d79591d0fc4862bed6", + "placeholder": "​", + "style": "IPY_MODEL_c0eb5bc8f6ee427ca42204b3c56f9a4e", + "value": "Fetching 10 files: 100%" + } + } + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 } diff --git a/examples/notebooks/intro/dpk_intro_1_ray.ipynb b/examples/notebooks/intro/dpk_intro_1_ray.ipynb index b39e30d2d..04af8ecd9 100644 --- a/examples/notebooks/intro/dpk_intro_1_ray.ipynb +++ b/examples/notebooks/intro/dpk_intro_1_ray.ipynb @@ -68,7 +68,11 @@ "execution_count": 1, "id": "1fe354b7", "metadata": { - "id": "1fe354b7" + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "1fe354b7", + "outputId": "6665c654-baa5-46dc-d370-9931e0e9eed3" }, "outputs": [ { @@ -105,7 +109,11 @@ "execution_count": 2, "id": "3309799e", "metadata": { - "id": "3309799e" + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "3309799e", + "outputId": "00d7362e-d675-4aaf-8c87-d99027d9a06c" }, "outputs": [], "source": [ @@ -131,14 +139,19 @@ "execution_count": 3, "id": "1fcec577", "metadata": { - "id": "1fcec577" + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "1fcec577", + "outputId": "48cf233b-f04e-4b9b-9605-423f87693f10" }, "outputs": [], "source": [ "if RUNNING_IN_COLAB:\n", " ! pip install --default-timeout=100 \\\n", - " data-prep-toolkit[ray]==0.2.2.dev1 \\\n", - " data-prep-toolkit-transforms[ray,all]==0.2.2.dev1 \\\n", + " data-prep-toolkit-transforms==0.2.1 \\\n", + " data-prep-toolkit-transforms-ray==0.2.1 \\\n", " deepsearch-toolkit" ] }, @@ -187,7 +200,7 @@ "base_uri": "https://localhost:8080/" }, "id": "e4YMZrBuFycl", - "outputId": "54e232da-b2a8-4f3e-d983-94259505dad3" + "outputId": "1a1d5f01-0856-40b6-8b1c-8187b0c38d64" }, "outputs": [ { @@ -218,7 +231,7 @@ "base_uri": "https://localhost:8080/" }, "id": "33345487", - "outputId": "c14c3a3d-c074-4535-b75d-19c5effa7d94" + "outputId": "f3e71a25-4864-4f8f-dfce-4af3d7e08a8a" }, "outputs": [ { @@ -226,7 +239,7 @@ "output_type": "stream", "text": [ "MY_CONFIG.RAY_RUNTIME_WORKERS: 2\n", - "MY_CONFIG.RAY_NUM_CPUS: 1\n", + "MY_CONFIG.RAY_NUM_CPUS: 0.8\n", "MY_CONFIG.RAY_MEMORY_GB: 2\n" ] } @@ -259,10 +272,11 @@ "else: # local run\n", " num_cpus_available = os.cpu_count()\n", " # print (num_cpus_available)\n", - " MY_CONFIG.RAY_NUM_CPUS = 1\n", + "\n", + " MY_CONFIG.RAY_RUNTIME_WORKERS = 2\n", + " MY_CONFIG.RAY_NUM_CPUS = 0.8\n", " MY_CONFIG.RAY_MEMORY_GB = 2 # GB\n", " # MY_CONFIG.RAY_RUNTIME_WORKERS = num_cpus_available // 3\n", - " MY_CONFIG.RAY_RUNTIME_WORKERS = 2\n", "\n", "print ('MY_CONFIG.RAY_RUNTIME_WORKERS:', MY_CONFIG.RAY_RUNTIME_WORKERS)\n", "print ('MY_CONFIG.RAY_NUM_CPUS:', MY_CONFIG.RAY_NUM_CPUS)\n", @@ -305,7 +319,7 @@ "base_uri": "https://localhost:8080/" }, "id": "60ac8bee-0960-4309-b225-d7a211b14262", - "outputId": "fd42f265-445f-488c-8c62-b293424f162d" + "outputId": "ec5beb05-027a-49eb-9a96-271471619d81" }, "outputs": [ { @@ -370,7 +384,7 @@ "base_uri": "https://localhost:8080/" }, "id": "482605b2-d814-456d-9195-49a2ec454ef0", - "outputId": "f4c02b6f-effd-4d04-8547-f270f721f8d2" + "outputId": "f8383739-a4fb-450c-dc37-5df32aab8212" }, "outputs": [ { @@ -409,38 +423,38 @@ "base_uri": "https://localhost:8080/" }, "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", - "outputId": "2cb0721a-1526-4129-a72f-77c1beefafdb" + "outputId": "14a36e73-a186-4431-a755-f46ccb691130" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "22:45:46 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", - "22:45:46 INFO - pipeline id pipeline_id\n", - "22:45:46 INFO - code location None\n", - "22:45:46 INFO - number of workers 2 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}\n", - "22:45:46 INFO - actor creation delay 0\n", - "22:45:46 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}\n", - "22:45:46 INFO - data factory data_ is using local data access: input_folder - input/solar-system output_folder - output/01_parquet_out\n", - "22:45:46 INFO - data factory data_ max_files -1, n_sample -1\n", - "22:45:46 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", - "22:45:46 INFO - Running locally\n", - "2024-10-16 22:45:48,783\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:45:52 INFO - orchestrator started at 2024-10-16 22:45:52\n", - "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:45:52 INFO - Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}\n", - "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:45:52 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.14609298761934, 'object_store': 3.073046493344009}\n", - "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:45:52 INFO - Number of workers - 2 with {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1} each\n", - "\u001b[36m(RayTransformFileProcessor pid=1001895)\u001b[0m 22:45:55 INFO - Initializing models\n", - "Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 103563.06it/s]\n", - "\u001b[36m(RayTransformFileProcessor pid=1001895)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n", - "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:46:00 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:46:02 INFO - Completed processing 2 files in 0.033 min\n", - "\u001b[36m(orchestrate pid=1000934)\u001b[0m 22:46:02 INFO - done flushing in 0.001 sec\n", - "\u001b[36m(RayTransformFileProcessor pid=1001896)\u001b[0m 22:45:55 INFO - Initializing models\n", - "Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 126716.13it/s]\n", - "\u001b[36m(RayTransformFileProcessor pid=1001896)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n", - "22:46:12 INFO - Completed execution in 0.43 min, execution result 0\n" + "13:30:44 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", + "13:30:44 INFO - pipeline id pipeline_id\n", + "13:30:44 INFO - code location None\n", + "13:30:44 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'memory': 2147483648, 'max_restarts': -1}\n", + "13:30:44 INFO - actor creation delay 0\n", + "13:30:44 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}\n", + "13:30:44 INFO - data factory data_ is using local data access: input_folder - input/solar-system output_folder - output/01_parquet_out\n", + "13:30:44 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:30:44 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", + "13:30:44 INFO - Running locally\n", + "2024-10-18 13:30:47,436\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=9266)\u001b[0m 13:30:50 INFO - orchestrator started at 2024-10-18 13:30:50\n", + "\u001b[36m(orchestrate pid=9266)\u001b[0m 13:30:50 INFO - Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}\n", + "\u001b[36m(orchestrate pid=9266)\u001b[0m 13:30:50 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 14.872821807861328, 'object_store': 7.436410903930664}\n", + "\u001b[36m(orchestrate pid=9266)\u001b[0m 13:30:50 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'memory': 2147483648, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=9266)\u001b[0m 13:30:50 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(RayTransformFileProcessor pid=10098)\u001b[0m 13:30:53 INFO - Initializing models\n", + "Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 110376.42it/s]\n", + "\u001b[36m(RayTransformFileProcessor pid=10098)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n", + "\u001b[36m(orchestrate pid=9266)\u001b[0m 13:30:59 INFO - Completed processing 2 files in 0.145 min\n", + "\u001b[36m(orchestrate pid=9266)\u001b[0m 13:30:59 INFO - done flushing in 0.001 sec\n", + "\u001b[36m(RayTransformFileProcessor pid=10099)\u001b[0m 13:30:53 INFO - Initializing models\n", + "Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 73713.60it/s]\n", + "\u001b[36m(RayTransformFileProcessor pid=10099)\u001b[0m Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.\n", + "13:31:09 INFO - Completed execution in 0.421 min, execution result 0\n" ] }, { @@ -448,8 +462,8 @@ "output_type": "stream", "text": [ "✅ Stage:1 completed successfully\n", - "CPU times: user 4.46 s, sys: 1.22 s, total: 5.69 s\n", - "Wall time: 30.4 s\n" + "CPU times: user 4.41 s, sys: 1.39 s, total: 5.8 s\n", + "Wall time: 31.1 s\n" ] } ], @@ -528,7 +542,7 @@ "height": 255 }, "id": "fe59563d", - "outputId": "40c31bad-d00a-4da9-8169-9db1bcc47704" + "outputId": "d10c022d-524f-4a13-ebf8-6431114e9172" }, "outputs": [ { @@ -581,12 +595,12 @@ " 1\n", " 0\n", " 11\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", " \n", " \n", @@ -596,12 +610,12 @@ " 1\n", " 0\n", " 11\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", " \n", " \n", @@ -614,16 +628,16 @@ "1 earth.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", "\n", " num_tables num_doc_elements document_id ext \\\n", - "0 0 11 f20aa513-8473-4bf7-a746-a66eb28b722c pdf \n", - "1 0 11 b4c44875-3612-4c5a-b387-2f04c63d1276 pdf \n", + "0 0 11 62e5639f-f922-4ccc-a041-3cb02f1cfd83 pdf \n", + "1 0 11 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", "1 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \n", - "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "1 2024-10-16T22:46:02.131556 2.001925 earth.pdf " + "0 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "1 2024-10-18T13:30:59.494027 2.015123 earth.pdf " ] }, "execution_count": 10, @@ -674,7 +688,7 @@ "base_uri": "https://localhost:8080/" }, "id": "f870e624", - "outputId": "fd259342-158a-4a33-f148-d8462e2f1ca2" + "outputId": "9142246b-988c-4674-99d7-e2f3fffbaaf4" }, "outputs": [ { @@ -826,7 +840,7 @@ "base_uri": "https://localhost:8080/" }, "id": "e1a10c2d", - "outputId": "68cdc0c0-3bf5-45a2-d2bc-99aa79e3e0d5" + "outputId": "ca74113e-6fd3-488b-836a-60bd58299fb1" }, "outputs": [ { @@ -1000,7 +1014,7 @@ "base_uri": "https://localhost:8080/" }, "id": "305f00a3", - "outputId": "7a800f4b-bc80-452d-c3d6-170e19f3422e" + "outputId": "689f1531-7007-49d9-9a27-39c39f8f2c50" }, "outputs": [ { @@ -1041,32 +1055,32 @@ "base_uri": "https://localhost:8080/" }, "id": "5b7b18d5", - "outputId": "e6f06879-906c-47d0-ef34-b018e4efa00f" + "outputId": "0146bd91-2ccb-4e56-c649-f415a38bfcf8" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "22:46:15 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'dl_min_chunk_len': None, 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox', 'chunk_size_tokens': 128, 'chunk_overlap_tokens': 30}\n", - "22:46:15 INFO - pipeline id pipeline_id\n", - "22:46:15 INFO - code location None\n", - "22:46:15 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "22:46:15 INFO - actor creation delay 0\n", - "22:46:15 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_chunk', 'job type': 'ray', 'job id': 'job_id'}\n", - "22:46:15 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", - "22:46:15 INFO - data factory data_ max_files -1, n_sample -1\n", - "22:46:15 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "22:46:15 INFO - Running locally\n", - "2024-10-16 22:46:16,484\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:19 INFO - orchestrator started at 2024-10-16 22:46:19\n", - "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:19 INFO - Number of files is 2, source profile {'max_file_size': 0.02239513397216797, 'min_file_size': 0.02167987823486328, 'total_file_size': 0.04407501220703125}\n", - "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:19 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.136235047131777, 'object_store': 3.068117522634566}\n", - "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:19 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:21 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:21 INFO - Completed processing 2 files in 0.0 min\n", - "\u001b[36m(orchestrate pid=1002677)\u001b[0m 22:46:21 INFO - done flushing in 0.001 sec\n", - "22:46:31 INFO - Completed execution in 0.271 min, execution result 0\n" + "13:31:12 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'dl_min_chunk_len': None, 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox'}\n", + "13:31:12 INFO - pipeline id pipeline_id\n", + "13:31:12 INFO - code location None\n", + "13:31:12 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", + "13:31:12 INFO - actor creation delay 0\n", + "13:31:12 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_chunk', 'job type': 'ray', 'job id': 'job_id'}\n", + "13:31:12 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", + "13:31:12 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:31:12 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:31:12 INFO - Running locally\n", + "2024-10-18 13:31:14,121\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=10912)\u001b[0m 13:31:16 INFO - orchestrator started at 2024-10-18 13:31:16\n", + "\u001b[36m(orchestrate pid=10912)\u001b[0m 13:31:16 INFO - Number of files is 2, source profile {'max_file_size': 0.02239513397216797, 'min_file_size': 0.02167987823486328, 'total_file_size': 0.04407501220703125}\n", + "\u001b[36m(orchestrate pid=10912)\u001b[0m 13:31:16 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 14.963891602121294, 'object_store': 7.4819458005949855}\n", + "\u001b[36m(orchestrate pid=10912)\u001b[0m 13:31:16 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=10912)\u001b[0m 13:31:16 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=10912)\u001b[0m 13:31:18 INFO - Completed processing 2 files in 0.032 min\n", + "\u001b[36m(orchestrate pid=10912)\u001b[0m 13:31:18 INFO - done flushing in 0.001 sec\n", + "13:31:28 INFO - Completed execution in 0.269 min, execution result 0\n" ] }, { @@ -1074,8 +1088,8 @@ "output_type": "stream", "text": [ "✅ Stage:2 completed successfully\n", - "CPU times: user 1.04 s, sys: 360 ms, total: 1.4 s\n", - "Wall time: 19.1 s\n" + "CPU times: user 982 ms, sys: 291 ms, total: 1.27 s\n", + "Wall time: 18.9 s\n" ] } ], @@ -1140,7 +1154,7 @@ "height": 897 }, "id": "d8138d43", - "outputId": "3e040b55-8c94-4f97-fedf-d2dbead55a72" + "outputId": "e1758b0c-5f22-4368-c3e6-ff778fc9ae82" }, "outputs": [ { @@ -1202,10 +1216,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", @@ -1221,10 +1235,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Solar System\\nFor more details about the Solar...\n", " $.main-text[3]\n", " 1\n", @@ -1240,10 +1254,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", @@ -1259,10 +1273,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Basic facts about Mars:\\n· Distance from the S...\n", " $.main-text[6]\n", " 1\n", @@ -1278,10 +1292,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", @@ -1297,10 +1311,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", @@ -1316,10 +1330,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", @@ -1335,10 +1349,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Earth\\nBasic facts about Earth:\\n· Distance fr...\n", " $.main-text[6]\n", " 1\n", @@ -1371,24 +1385,24 @@ "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "1 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "2 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "3 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "4 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "5 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "6 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "7 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "0 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "1 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "2 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "3 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "4 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "5 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "6 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "7 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", "\n", " source_document_id \\\n", - "0 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "1 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "2 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "3 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "4 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "5 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "6 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "7 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "0 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "1 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "2 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "3 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "4 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "5 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "6 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "7 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", @@ -1466,7 +1480,7 @@ "height": 300 }, "id": "3090c950", - "outputId": "4c3b6461-ae8c-41d9-8c71-e1bbe634b9ed" + "outputId": "3f542446-2cfa-404c-c642-3732f7b74568" }, "outputs": [ { @@ -1569,7 +1583,7 @@ "base_uri": "https://localhost:8080/" }, "id": "d5f151ae", - "outputId": "3dc3ec5d-31d7-4081-db16-8bb6051ea80a" + "outputId": "4616d648-0852-4ecb-cef8-f5940e176de0" }, "outputs": [ { @@ -1662,7 +1676,7 @@ "base_uri": "https://localhost:8080/" }, "id": "1f747c0d", - "outputId": "765daa01-138b-4bfa-a75c-bffc80f9e246" + "outputId": "e42500b7-5d1e-41fd-b53b-34d3393f36f4" }, "outputs": [ { @@ -1704,36 +1718,35 @@ "id": "f6e9e145", "metadata": { "colab": { - "base_uri": "https://localhost:8080/", - "height": 883 + "base_uri": "https://localhost:8080/" }, "id": "f6e9e145", - "outputId": "fe3d0a3d-0575-4dd8-8564-e336a6ddb68d" + "outputId": "2add5f0c-3ab6-4336-8a7b-ac8b1b76ab73" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "22:46:32 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", - "22:46:32 INFO - pipeline id pipeline_id\n", - "22:46:32 INFO - code location None\n", - "22:46:32 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "22:46:32 INFO - actor creation delay 0\n", - "22:46:32 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_id', 'job type': 'ray', 'job id': 'job_id'}\n", - "22:46:32 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", - "22:46:32 INFO - data factory data_ max_files -1, n_sample -1\n", - "22:46:32 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "22:46:32 INFO - Running locally\n", - "2024-10-16 22:46:33,897\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:35 INFO - orchestrator started at 2024-10-16 22:46:35\n", - "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:35 INFO - Number of files is 2, source profile {'max_file_size': 0.008975982666015625, 'min_file_size': 0.008897781372070312, 'total_file_size': 0.017873764038085938}\n", - "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:35 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.126107025891542, 'object_store': 3.0630535120144486}\n", - "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:35 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:36 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:36 INFO - Completed processing 2 files in 0.003 min\n", - "\u001b[36m(orchestrate pid=1004253)\u001b[0m 22:46:36 INFO - done flushing in 0.001 sec\n", - "22:46:46 INFO - Completed execution in 0.227 min, execution result 0\n" + "13:31:29 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", + "13:31:29 INFO - pipeline id pipeline_id\n", + "13:31:29 INFO - code location None\n", + "13:31:29 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", + "13:31:29 INFO - actor creation delay 0\n", + "13:31:29 INFO - job details {'job category': 'preprocessing', 'job name': 'doc_id', 'job type': 'ray', 'job id': 'job_id'}\n", + "13:31:29 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", + "13:31:29 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:31:29 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:31:29 INFO - Running locally\n", + "2024-10-18 13:31:31,792\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=12291)\u001b[0m 13:31:32 INFO - orchestrator started at 2024-10-18 13:31:32\n", + "\u001b[36m(orchestrate pid=12291)\u001b[0m 13:31:32 INFO - Number of files is 2, source profile {'max_file_size': 0.008975982666015625, 'min_file_size': 0.008897781372070312, 'total_file_size': 0.017873764038085938}\n", + "\u001b[36m(orchestrate pid=12291)\u001b[0m 13:31:32 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 15.033103181049228, 'object_store': 7.516551589593291}\n", + "\u001b[36m(orchestrate pid=12291)\u001b[0m 13:31:32 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=12291)\u001b[0m 13:31:32 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=12291)\u001b[0m 13:31:33 INFO - Completed processing 2 files in 0.012 min\n", + "\u001b[36m(orchestrate pid=12291)\u001b[0m 13:31:33 INFO - done flushing in 0.001 sec\n", + "13:31:43 INFO - Completed execution in 0.228 min, execution result 0\n" ] }, { @@ -1741,8 +1754,8 @@ "output_type": "stream", "text": [ "✅ Stage:3 completed successfully\n", - "CPU times: user 122 ms, sys: 153 ms, total: 276 ms\n", - "Wall time: 14.9 s\n" + "CPU times: user 123 ms, sys: 145 ms, total: 267 ms\n", + "Wall time: 15.2 s\n" ] } ], @@ -1808,10 +1821,10 @@ "metadata": { "colab": { "base_uri": "https://localhost:8080/", - "height": 373 + "height": 860 }, "id": "1911179a", - "outputId": "b82445e8-ebba-48fa-b1c2-26a9e0743ef9" + "outputId": "45e83e2a-1f70-46b9-e311-c50f025419be" }, "outputs": [ { @@ -1873,10 +1886,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", @@ -1894,10 +1907,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Solar System\\nFor more details about the Solar...\n", " $.main-text[3]\n", " 1\n", @@ -1915,10 +1928,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", @@ -1936,10 +1949,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Basic facts about Mars:\\n· Distance from the S...\n", " $.main-text[6]\n", " 1\n", @@ -1957,10 +1970,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", @@ -1978,10 +1991,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", @@ -1999,10 +2012,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", @@ -2020,10 +2033,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Earth\\nBasic facts about Earth:\\n· Distance fr...\n", " $.main-text[6]\n", " 1\n", @@ -2058,24 +2071,24 @@ "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "1 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "2 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "3 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "4 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "5 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "6 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "7 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "0 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "1 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "2 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "3 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "4 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "5 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "6 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "7 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", "\n", " source_document_id \\\n", - "0 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "1 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "2 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "3 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "4 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "5 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "6 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "7 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "0 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "1 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "2 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "3 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "4 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "5 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "6 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "7 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", @@ -2160,7 +2173,11 @@ "execution_count": 21, "id": "4c7a1b94", "metadata": { - "id": "4c7a1b94" + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "4c7a1b94", + "outputId": "40a119b4-44fc-483d-9ad0-da178a2a8eb1" }, "outputs": [ { @@ -2197,32 +2214,36 @@ "execution_count": 22, "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", "metadata": { - "id": "a624b2b2-faad-4325-ac7d-53a840f564ef" + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", + "outputId": "bd0f3f94-8c48-4c6b-b911-858e389243f4" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "22:46:47 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None, 'hash_cpu': 0.5, 'num_hashes': 2}\n", - "22:46:47 INFO - pipeline id pipeline_id\n", - "22:46:47 INFO - code location None\n", - "22:46:47 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "22:46:47 INFO - actor creation delay 0\n", - "22:46:47 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}\n", - "22:46:47 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", - "22:46:47 INFO - data factory data_ max_files -1, n_sample -1\n", - "22:46:47 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "22:46:47 INFO - Running locally\n", - "2024-10-16 22:46:48,851\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:50 INFO - orchestrator started at 2024-10-16 22:46:50\n", - "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:50 INFO - Number of files is 2, source profile {'max_file_size': 0.010180473327636719, 'min_file_size': 0.010101318359375, 'total_file_size': 0.02028179168701172}\n", - "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:50 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.11034622322768, 'object_store': 3.055173110216856}\n", - "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:50 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:51 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:51 INFO - Completed processing 2 files in 0.003 min\n", - "\u001b[36m(orchestrate pid=1005823)\u001b[0m 22:46:51 INFO - done flushing in 0.001 sec\n", - "22:47:01 INFO - Completed execution in 0.226 min, execution result 0\n" + "13:31:45 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None, 'hash_cpu': 0.5, 'num_hashes': 2}\n", + "13:31:45 INFO - pipeline id pipeline_id\n", + "13:31:45 INFO - code location None\n", + "13:31:45 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", + "13:31:45 INFO - actor creation delay 0\n", + "13:31:45 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}\n", + "13:31:45 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", + "13:31:45 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:31:45 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:31:45 INFO - Running locally\n", + "2024-10-18 13:31:47,001\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=13775)\u001b[0m 13:31:48 INFO - orchestrator started at 2024-10-18 13:31:48\n", + "\u001b[36m(orchestrate pid=13775)\u001b[0m 13:31:48 INFO - Number of files is 2, source profile {'max_file_size': 0.010180473327636719, 'min_file_size': 0.010101318359375, 'total_file_size': 0.02028179168701172}\n", + "\u001b[36m(orchestrate pid=13775)\u001b[0m 13:31:48 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 15.010423279367387, 'object_store': 7.505211639218032}\n", + "\u001b[36m(orchestrate pid=13775)\u001b[0m 13:31:48 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=13775)\u001b[0m 13:31:48 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=13775)\u001b[0m 13:31:48 INFO - Completed processing 2 files in 0.013 min\n", + "\u001b[36m(orchestrate pid=13775)\u001b[0m 13:31:48 INFO - done flushing in 0.001 sec\n", + "13:31:58 INFO - Completed execution in 0.228 min, execution result 0\n" ] }, { @@ -2230,8 +2251,8 @@ "output_type": "stream", "text": [ "✅ Stage:4 completed successfully\n", - "CPU times: user 125 ms, sys: 134 ms, total: 259 ms\n", - "Wall time: 15 s\n" + "CPU times: user 136 ms, sys: 154 ms, total: 289 ms\n", + "Wall time: 15.2 s\n" ] } ], @@ -2292,7 +2313,12 @@ "execution_count": 23, "id": "d824ebf6", "metadata": { - "id": "d824ebf6" + "colab": { + "base_uri": "https://localhost:8080/", + "height": 815 + }, + "id": "d824ebf6", + "outputId": "9173efb6-1b95-4a7e-b531-1a611841a4d0" }, "outputs": [ { @@ -2358,32 +2384,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", - " Solar System\\nOur solar system is a vast and f...\n", - " $.main-text[2]\n", - " 1\n", - " [132.84518433, 588.96014404, 479.40917969, 623...\n", - " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", - " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", - " 4\n", - " []\n", - " \n", - " \n", - " 1\n", - " mars.pdf\n", - " 1\n", - " 0\n", - " 11\n", - " pdf\n", - " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", - " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", - " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Solar System\\nFor more details about the Solar...\n", " $.main-text[3]\n", " 1\n", @@ -2391,10 +2395,10 @@ " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", " 5\n", - " []\n", + " [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...\n", " \n", " \n", - " 2\n", + " 1\n", " mars.pdf\n", " 1\n", " 0\n", @@ -2402,10 +2406,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", @@ -2416,7 +2420,7 @@ " []\n", " \n", " \n", - " 3\n", + " 2\n", " mars.pdf\n", " 1\n", " 0\n", @@ -2424,10 +2428,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Basic facts about Mars:\\n· Distance from the S...\n", " $.main-text[6]\n", " 1\n", @@ -2438,6 +2442,28 @@ " []\n", " \n", " \n", + " 3\n", + " earth.pdf\n", + " 1\n", + " 0\n", + " 11\n", + " pdf\n", + " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", + " 2686\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", + " earth.pdf\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", + " Solar System\\nOur solar system is a vast and f...\n", + " $.main-text[2]\n", + " 1\n", + " [132.87112427, 588.96014404, 479.40917969, 623...\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", + " 0\n", + " []\n", + " \n", + " \n", " 4\n", " earth.pdf\n", " 1\n", @@ -2446,10 +2472,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", @@ -2457,7 +2483,7 @@ " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", " 1\n", - " [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...\n", + " []\n", " \n", " \n", " 5\n", @@ -2468,10 +2494,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", @@ -2490,10 +2516,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Earth\\nBasic facts about Earth:\\n· Distance fr...\n", " $.main-text[6]\n", " 1\n", @@ -2512,7 +2538,7 @@ "0 mars.pdf 1 0 11 pdf \n", "1 mars.pdf 1 0 11 pdf \n", "2 mars.pdf 1 0 11 pdf \n", - "3 mars.pdf 1 0 11 pdf \n", + "3 earth.pdf 1 0 11 pdf \n", "4 earth.pdf 1 0 11 pdf \n", "5 earth.pdf 1 0 11 pdf \n", "6 earth.pdf 1 0 11 pdf \n", @@ -2521,71 +2547,71 @@ "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "3 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", + "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "1 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "2 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "3 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "4 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "5 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "6 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "0 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "1 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "2 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "3 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "4 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "5 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "6 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", "\n", " source_document_id \\\n", - "0 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "1 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "2 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "3 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "4 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "5 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "6 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "0 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "1 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "2 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "3 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "4 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "5 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "6 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", "\n", " contents doc_jsonpath \\\n", - "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "1 Solar System\\nFor more details about the Solar... $.main-text[3] \n", - "2 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "3 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", + "0 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "1 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "2 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", + "3 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", "4 Solar System\\nFor more details about our Solar... $.main-text[3] \n", "5 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", "6 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", "\n", " page_number bbox \\\n", - "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", - "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", - "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", - "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "0 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "1 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "2 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "3 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", "4 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", "5 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", "6 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", "\n", " document_id \\\n", - "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", - "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", - "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", - "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", "\n", " chunk_hash chunk_id \\\n", - "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 \n", - "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", - "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", - "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", + "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", + "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 \n", "\n", " removed \n", - "0 [] \n", + "0 [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567... \n", "1 [] \n", "2 [] \n", "3 [] \n", - "4 [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567... \n", + "4 [] \n", "5 [] \n", "6 [] " ] @@ -2614,7 +2640,12 @@ "execution_count": 24, "id": "82cc9bb0", "metadata": { - "id": "82cc9bb0" + "colab": { + "base_uri": "https://localhost:8080/", + "height": 269 + }, + "id": "82cc9bb0", + "outputId": "e043fa01-ceca-49ae-b764-8154219c7b6c" }, "outputs": [ { @@ -2646,22 +2677,22 @@ " \n", " 0\n", " mars.pdf\n", - " Solar System\\nOur solar system is a vast and f...\n", + " Solar System\\nFor more details about the Solar...\n", " \n", " \n", " 1\n", " mars.pdf\n", - " Solar System\\nFor more details about the Solar...\n", + " Mars\\nMars, the fourth planet from the Sun, is...\n", " \n", " \n", " 2\n", " mars.pdf\n", - " Mars\\nMars, the fourth planet from the Sun, is...\n", + " Basic facts about Mars:\\n· Distance from the S...\n", " \n", " \n", " 3\n", - " mars.pdf\n", - " Basic facts about Mars:\\n· Distance from the S...\n", + " earth.pdf\n", + " Solar System\\nOur solar system is a vast and f...\n", " \n", " \n", " 4\n", @@ -2684,10 +2715,10 @@ ], "text/plain": [ " filename contents\n", - "0 mars.pdf Solar System\\nOur solar system is a vast and f...\n", - "1 mars.pdf Solar System\\nFor more details about the Solar...\n", - "2 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", - "3 mars.pdf Basic facts about Mars:\\n· Distance from the S...\n", + "0 mars.pdf Solar System\\nFor more details about the Solar...\n", + "1 mars.pdf Mars\\nMars, the fourth planet from the Sun, is...\n", + "2 mars.pdf Basic facts about Mars:\\n· Distance from the S...\n", + "3 earth.pdf Solar System\\nOur solar system is a vast and f...\n", "4 earth.pdf Solar System\\nFor more details about our Solar...\n", "5 earth.pdf Earth\\nEarth is the third planet from the Sun....\n", "6 earth.pdf Earth\\nBasic facts about Earth:\\n· Distance fr..." @@ -2707,7 +2738,11 @@ "execution_count": 25, "id": "cc61dffa", "metadata": { - "id": "cc61dffa" + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cc61dffa", + "outputId": "aff7a0d9-a791-42a5-d5b7-ad643f59f261" }, "outputs": [ { @@ -2717,17 +2752,13 @@ "========== mars.pdf ===========\n", "-------Chunk 0------\n", "Solar System\n", - "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", - "-------\n", - "-------Chunk 1------\n", - "Solar System\n", "For more details about the Solar system see Chapter 1.\n", "-------\n", - "-------Chunk 2------\n", + "-------Chunk 1------\n", "Mars\n", "Mars, the fourth planet from the Sun, is a cold, desert world with a thin atmosphere composed primarily of carbon dioxide. Its reddish hue comes from iron oxide, or rust, prevalent on its surface.\n", "-------\n", - "-------Chunk 3------\n", + "-------Chunk 2------\n", "Basic facts about Mars:\n", "· Distance from the Sun: Average of 228 million kilometers (142 million miles)\n", "· Rotation Period: 24.6 hours (one Martian day - called a \"sol\")\n", @@ -2736,13 +2767,17 @@ "========== earth.pdf ===========\n", "-------Chunk 0------\n", "Solar System\n", - "For more details about our Solar system see Chapter 1.\n", + "Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.\n", "-------\n", "-------Chunk 1------\n", + "Solar System\n", + "For more details about our Solar system see Chapter 1.\n", + "-------\n", + "-------Chunk 2------\n", "Earth\n", "Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.\n", "-------\n", - "-------Chunk 2------\n", + "-------Chunk 3------\n", "Earth\n", "Basic facts about Earth:\n", "· Distance from the Sun: Average of 149.6 million kilometers (93 million miles)\n", @@ -2810,7 +2845,11 @@ "execution_count": 26, "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399", "metadata": { - "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399" + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399", + "outputId": "d53a92d2-0f1c-465f-f11c-b9bc2931f651" }, "outputs": [ { @@ -2849,56 +2888,60 @@ "execution_count": 27, "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f", "metadata": { - "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f" + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f", + "outputId": "1e63d364-3944-465a-ff7c-6e1dc750b2de" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "22:47:02 INFO - fuzzy dedup params are {'doc_column': 'contents', 'id_column': 'chunk_id', 'cluster_column': 'chunk_hash', 'bucket_cpu': 0.3, 'mhash_cpu': 0.3, 'doc_cpu': 0.3, 'num_doc_actors': 1, 'num_minhash_actors': 1, 'num_bucket_actors': 1, 'num_preprocessors': 1, 'num_permutations': 64, 'threshold': 0.7, 'shingles_size': 5, 'delimiters': ' ', 'snapshot_delay': 1, 'use_bucket_snapshot': False, 'use_doc_snapshot': False, 'random_delay_limit': 10, 'worker_options': {'num_cpus': 1}}\n", - "22:47:02 INFO - pipeline id pipeline_id\n", - "22:47:02 INFO - code location None\n", - "22:47:02 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "22:47:02 INFO - actor creation delay 0\n", - "22:47:02 INFO - job details {'job category': 'preprocessing', 'job name': 'fdedup', 'job type': 'ray', 'job id': 'job_id'}\n", - "22:47:02 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/05_fuzzy_dedupe_out\n", - "22:47:02 INFO - data factory data_ max_files -1, n_sample -1\n", - "22:47:02 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "22:47:02 INFO - Running locally\n", - "2024-10-16 22:47:03,977\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - orchestrator started at 2024-10-16 22:47:05\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - Number of files is 2, source profile {'max_file_size': 0.010180473327636719, 'min_file_size': 0.010101318359375, 'total_file_size': 0.02028179168701172}\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.128299713134766, 'object_store': 3.064149856567383}\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - starting run from the beginning\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - continuing from the very beginning\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - Fuzzy: num buckets 8, bucket length 8\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - created 1 bucket actors\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - created 1 minhash actors\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:05 INFO - Table preprocessing uses 1 readers\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:06 INFO - created 1 table processor actors\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:12 INFO - Completed 1 files in 0.104 min\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:12 INFO - Completed 1 files (50.0%) in 0.104 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:15 INFO - Completed processing 2 files in 0.154 min\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:15 INFO - creating minhash snapshots\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:16 INFO - minhash snapshots created\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:16 INFO - creating bucket snapshots\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:17 INFO - bucket snapshots created\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:17 INFO - created 1 document actors\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:18 INFO - created 1 bucket processor actors\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:18 INFO - created bucket processor invoker\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:18 INFO - added invoker to bucket collectors\n", - "\u001b[36m(BucketsHash pid=1008361)\u001b[0m 22:47:18 INFO - processing buckets 0 long, 53 short\n", - "\u001b[36m(BucketsHash pid=1008361)\u001b[0m 22:47:18 INFO - Done submitting long buckets\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:19 INFO - Done processing buckets in 0.012 min\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:19 INFO - creating document snapshots\n", - "\u001b[36m(BucketsHashProcessorInvoker pid=1008950)\u001b[0m 22:47:19 INFO - Waiting bucket processing completion. Submitted requests 1\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:20 INFO - document snapshots created\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:21 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:30 INFO - Completed processing 2 files in 0.153 min\n", - "\u001b[36m(orchestrate pid=1007500)\u001b[0m 22:47:30 INFO - done flushing in 0.001 sec\n", - "22:47:40 INFO - Completed execution in 0.632 min, execution result 0\n" + "13:32:00 INFO - fuzzy dedup params are {'doc_column': 'contents', 'id_column': 'chunk_id', 'cluster_column': 'chunk_hash', 'bucket_cpu': 0.3, 'mhash_cpu': 0.3, 'doc_cpu': 0.3, 'num_doc_actors': 1, 'num_minhash_actors': 1, 'num_bucket_actors': 1, 'num_preprocessors': 1, 'num_permutations': 64, 'threshold': 0.7, 'shingles_size': 5, 'delimiters': ' ', 'snapshot_delay': 1, 'use_bucket_snapshot': False, 'use_doc_snapshot': False, 'random_delay_limit': 10, 'worker_options': {'num_cpus': 0.8}}\n", + "13:32:00 INFO - pipeline id pipeline_id\n", + "13:32:00 INFO - code location None\n", + "13:32:00 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", + "13:32:00 INFO - actor creation delay 0\n", + "13:32:00 INFO - job details {'job category': 'preprocessing', 'job name': 'fdedup', 'job type': 'ray', 'job id': 'job_id'}\n", + "13:32:00 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/05_fuzzy_dedupe_out\n", + "13:32:00 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:32:00 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:32:00 INFO - Running locally\n", + "2024-10-18 13:32:02,246\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - orchestrator started at 2024-10-18 13:32:03\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - Number of files is 2, source profile {'max_file_size': 0.010180473327636719, 'min_file_size': 0.010101318359375, 'total_file_size': 0.02028179168701172}\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 15.000544739887118, 'object_store': 7.500272369012237}\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - starting run from the beginning\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - continuing from the very beginning\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - Fuzzy: num buckets 8, bucket length 8\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - created 1 bucket actors\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - created 1 minhash actors\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - Table preprocessing uses 1 readers\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:03 INFO - created 1 table processor actors\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:07 INFO - Completed 1 files in 0.064 min\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:07 INFO - Completed 1 files (50.0%) in 0.064 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:15 INFO - Completed processing 2 files in 0.197 min\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:15 INFO - creating minhash snapshots\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:16 INFO - minhash snapshots created\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:16 INFO - creating bucket snapshots\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:17 INFO - bucket snapshots created\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:17 INFO - created 1 document actors\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:17 INFO - created 1 bucket processor actors\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:17 INFO - created bucket processor invoker\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:17 INFO - added invoker to bucket collectors\n", + "\u001b[36m(BucketsHash pid=16209)\u001b[0m 13:32:17 INFO - processing buckets 0 long, 53 short\n", + "\u001b[36m(BucketsHash pid=16209)\u001b[0m 13:32:17 INFO - Done submitting long buckets\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:17 INFO - Done processing buckets in 0.01 min\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:17 INFO - creating document snapshots\n", + "\u001b[36m(BucketsHashProcessorInvoker pid=16602)\u001b[0m 13:32:17 INFO - Waiting bucket processing completion. Submitted requests 1\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:18 INFO - document snapshots created\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:18 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:25 INFO - Completed processing 2 files in 0.113 min\n", + "\u001b[36m(orchestrate pid=15368)\u001b[0m 13:32:25 INFO - done flushing in 0.005 sec\n", + "13:32:35 INFO - Completed execution in 0.588 min, execution result 0\n" ] }, { @@ -2906,8 +2949,8 @@ "output_type": "stream", "text": [ "✅ Stage:5 completed successfully\n", - "CPU times: user 212 ms, sys: 201 ms, total: 413 ms\n", - "Wall time: 39.4 s\n" + "CPU times: user 270 ms, sys: 200 ms, total: 470 ms\n", + "Wall time: 36.6 s\n" ] } ], @@ -2986,7 +3029,12 @@ "execution_count": 28, "id": "e899ad60", "metadata": { - "id": "e899ad60" + "colab": { + "base_uri": "https://localhost:8080/", + "height": 677 + }, + "id": "e899ad60", + "outputId": "fcfda84c-ebbf-490f-f478-ceef7ca9e83b" }, "outputs": [ { @@ -3049,10 +3097,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", @@ -3070,10 +3118,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", @@ -3091,10 +3139,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Basic facts about Mars:\\n· Distance from the S...\n", " $.main-text[6]\n", " 1\n", @@ -3112,10 +3160,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", @@ -3133,10 +3181,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", @@ -3154,10 +3202,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Earth\\nBasic facts about Earth:\\n· Distance fr...\n", " $.main-text[6]\n", " 1\n", @@ -3188,20 +3236,20 @@ "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "1 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "2 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "3 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "4 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "5 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "0 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "1 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "2 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "3 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "4 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "5 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", "\n", " source_document_id \\\n", - "0 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "1 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "2 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "3 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "4 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "5 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "0 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "1 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "2 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "3 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "4 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "5 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", @@ -3250,7 +3298,12 @@ "execution_count": 29, "id": "ab7ea52b", "metadata": { - "id": "ab7ea52b" + "colab": { + "base_uri": "https://localhost:8080/", + "height": 238 + }, + "id": "ab7ea52b", + "outputId": "e38754ee-777f-4ed7-ebc0-9299ee122662" }, "outputs": [ { @@ -3337,7 +3390,11 @@ "execution_count": 30, "id": "6bdd3515", "metadata": { - "id": "6bdd3515" + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "6bdd3515", + "outputId": "e6e3f2c0-5b23-4336-bc95-013921f0724a" }, "outputs": [ { @@ -3451,7 +3508,11 @@ "execution_count": 31, "id": "20a153fa-fd56-401e-86be-4f7617affcc8", "metadata": { - "id": "20a153fa-fd56-401e-86be-4f7617affcc8" + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "20a153fa-fd56-401e-86be-4f7617affcc8", + "outputId": "530e65c6-7ceb-4c73-cb87-50da46c78add" }, "outputs": [ { @@ -3488,32 +3549,50 @@ "execution_count": 32, "id": "228df6b2-bc62-494b-9697-03ece98d7853", "metadata": { - "id": "228df6b2-bc62-494b-9697-03ece98d7853" + "colab": { + "base_uri": "https://localhost:8080/", + "height": 914, + "referenced_widgets": [ + "8b7571c585df431eb901fcdebdf8177e", + "06107a2f48b3491f91bbe84e46e10ba0", + "bd74356eca18423aa0373c808d9097e3", + "7e13e8779a81400f996d4428c74acfaf", + "a75892696be546a3970962bae7bf732a", + "68997339f13240a4824a9e416096bee4", + "919b086abd314077bbff75687392bd91", + "b4c209371e7a403986991a786cfb296d", + "6c08de2dd9a2402c90b1a7a645db9b13", + "91fff81a1de8487c9009e872b751edb0", + "ada62d24cbcf4361acbb21808f334d33" + ] + }, + "id": "228df6b2-bc62-494b-9697-03ece98d7853", + "outputId": "b10eecc1-cd17-49c1-e3b1-b80e0e1bfa86" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "22:47:42 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", - "22:47:42 INFO - pipeline id pipeline_id\n", - "22:47:42 INFO - code location None\n", - "22:47:42 INFO - number of workers 2 worker options {'num_cpus': 1, 'max_restarts': -1}\n", - "22:47:42 INFO - actor creation delay 0\n", - "22:47:42 INFO - job details {'job category': 'preprocessing', 'job name': 'text_encoder', 'job type': 'ray', 'job id': 'job_id'}\n", - "22:47:42 INFO - data factory data_ is using local data access: input_folder - output/05_fuzzy_dedupe_out output_folder - output/06_embeddings_out\n", - "22:47:42 INFO - data factory data_ max_files -1, n_sample -1\n", - "22:47:42 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "22:47:42 INFO - Running locally\n", - "2024-10-16 22:47:44,003\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:47 INFO - orchestrator started at 2024-10-16 22:47:47\n", - "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:47 INFO - Number of files is 2, source profile {'max_file_size': 0.009654045104980469, 'min_file_size': 0.00907135009765625, 'total_file_size': 0.01872539520263672}\n", - "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:47 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 6.101744843646884, 'object_store': 3.0508724208921194}\n", - "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:47 INFO - Number of workers - 2 with {'num_cpus': 1, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:53 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:53 INFO - Completed processing 2 files in 0.011 min\n", - "\u001b[36m(orchestrate pid=1009666)\u001b[0m 22:47:53 INFO - done flushing in 0.001 sec\n", - "22:48:03 INFO - Completed execution in 0.349 min, execution result 0\n" + "13:32:37 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", + "13:32:37 INFO - pipeline id pipeline_id\n", + "13:32:37 INFO - code location None\n", + "13:32:37 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", + "13:32:37 INFO - actor creation delay 0\n", + "13:32:37 INFO - job details {'job category': 'preprocessing', 'job name': 'text_encoder', 'job type': 'ray', 'job id': 'job_id'}\n", + "13:32:37 INFO - data factory data_ is using local data access: input_folder - output/05_fuzzy_dedupe_out output_folder - output/06_embeddings_out\n", + "13:32:37 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:32:37 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:32:37 INFO - Running locally\n", + "2024-10-18 13:32:39,609\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", + "\u001b[36m(orchestrate pid=17394)\u001b[0m 13:32:42 INFO - orchestrator started at 2024-10-18 13:32:42\n", + "\u001b[36m(orchestrate pid=17394)\u001b[0m 13:32:42 INFO - Number of files is 2, source profile {'max_file_size': 0.009654045104980469, 'min_file_size': 0.00907135009765625, 'total_file_size': 0.01872539520263672}\n", + "\u001b[36m(orchestrate pid=17394)\u001b[0m 13:32:42 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 14.943363189697266, 'object_store': 7.471681594848633}\n", + "\u001b[36m(orchestrate pid=17394)\u001b[0m 13:32:42 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", + "\u001b[36m(orchestrate pid=17394)\u001b[0m 13:32:42 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", + "\u001b[36m(orchestrate pid=17394)\u001b[0m 13:32:47 INFO - Completed processing 2 files in 0.087 min\n", + "\u001b[36m(orchestrate pid=17394)\u001b[0m 13:32:47 INFO - done flushing in 0.001 sec\n", + "13:32:57 INFO - Completed execution in 0.333 min, execution result 0\n" ] }, { @@ -3521,8 +3600,8 @@ "output_type": "stream", "text": [ "✅ Stage:6 completed successfully\n", - "CPU times: user 422 ms, sys: 241 ms, total: 663 ms\n", - "Wall time: 22.9 s\n" + "CPU times: user 607 ms, sys: 226 ms, total: 833 ms\n", + "Wall time: 22.1 s\n" ] } ], @@ -3578,7 +3657,12 @@ "execution_count": 33, "id": "7b1c1d09", "metadata": { - "id": "7b1c1d09" + "colab": { + "base_uri": "https://localhost:8080/", + "height": 659 + }, + "id": "7b1c1d09", + "outputId": "70612634-b336-4ad5-ddb3-782ca0676bae" }, "outputs": [ { @@ -3641,10 +3725,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", @@ -3663,10 +3747,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", @@ -3685,10 +3769,10 @@ " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-10-16T22:46:02.114286\n", - " 1.984612\n", + " 2024-10-18T13:30:59.490007\n", + " 2.011138\n", " mars.pdf\n", - " f20aa513-8473-4bf7-a746-a66eb28b722c\n", + " 62e5639f-f922-4ccc-a041-3cb02f1cfd83\n", " Basic facts about Mars:\\n· Distance from the S...\n", " $.main-text[6]\n", " 1\n", @@ -3707,10 +3791,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", @@ -3729,10 +3813,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", @@ -3751,10 +3835,10 @@ " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-10-16T22:46:02.131556\n", - " 2.001925\n", + " 2024-10-18T13:30:59.494027\n", + " 2.015123\n", " earth.pdf\n", - " b4c44875-3612-4c5a-b387-2f04c63d1276\n", + " f3c0ac2e-1de2-472b-8216-2043f3b3e9d1\n", " Earth\\nBasic facts about Earth:\\n· Distance fr...\n", " $.main-text[6]\n", " 1\n", @@ -3786,20 +3870,20 @@ "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "1 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "2 2024-10-16T22:46:02.114286 1.984612 mars.pdf \n", - "3 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "4 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", - "5 2024-10-16T22:46:02.131556 2.001925 earth.pdf \n", + "0 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "1 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "2 2024-10-18T13:30:59.490007 2.011138 mars.pdf \n", + "3 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "4 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", + "5 2024-10-18T13:30:59.494027 2.015123 earth.pdf \n", "\n", " source_document_id \\\n", - "0 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "1 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "2 f20aa513-8473-4bf7-a746-a66eb28b722c \n", - "3 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "4 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", - "5 b4c44875-3612-4c5a-b387-2f04c63d1276 \n", + "0 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "1 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "2 62e5639f-f922-4ccc-a041-3cb02f1cfd83 \n", + "3 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "4 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", + "5 f3c0ac2e-1de2-472b-8216-2043f3b3e9d1 \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", @@ -3865,7 +3949,11 @@ "execution_count": 34, "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", "metadata": { - "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207" + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", + "outputId": "d151e618-6528-40b5-fdbd-1c67291a7279" }, "outputs": [ { @@ -3887,7 +3975,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 31, "id": "dc0a6728", "metadata": { "id": "dc0a6728" @@ -3901,7 +3989,7 @@ "provenance": [] }, "kernelspec": { - "display_name": "dpk-1-basic-022dev1-py312", + "display_name": "dpk-2-basic-021-py311", "language": "python", "name": "python3" }, @@ -3915,7 +4003,353 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.7" + "version": "3.11.10" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "06107a2f48b3491f91bbe84e46e10ba0": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_68997339f13240a4824a9e416096bee4", + "placeholder": "​", + "style": "IPY_MODEL_919b086abd314077bbff75687392bd91", + "value": "" + } + }, + "68997339f13240a4824a9e416096bee4": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "6c08de2dd9a2402c90b1a7a645db9b13": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "7e13e8779a81400f996d4428c74acfaf": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_91fff81a1de8487c9009e872b751edb0", + "placeholder": "​", + "style": "IPY_MODEL_ada62d24cbcf4361acbb21808f334d33", + "value": " 0/0 [00:00<?, ?it/s]" + } + }, + "8b7571c585df431eb901fcdebdf8177e": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_06107a2f48b3491f91bbe84e46e10ba0", + "IPY_MODEL_bd74356eca18423aa0373c808d9097e3", + "IPY_MODEL_7e13e8779a81400f996d4428c74acfaf" + ], + "layout": "IPY_MODEL_a75892696be546a3970962bae7bf732a" + } + }, + "919b086abd314077bbff75687392bd91": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "91fff81a1de8487c9009e872b751edb0": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "a75892696be546a3970962bae7bf732a": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "ada62d24cbcf4361acbb21808f334d33": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "b4c209371e7a403986991a786cfb296d": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": "20px" + } + }, + "bd74356eca18423aa0373c808d9097e3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_b4c209371e7a403986991a786cfb296d", + "max": 1, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_6c08de2dd9a2402c90b1a7a645db9b13", + "value": 0 + } + } + } } }, "nbformat": 4, From 27e7134ef20f19c1ed0132820a3a187c5f21b229 Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Mon, 21 Oct 2024 08:02:01 -0700 Subject: [PATCH 06/10] Update examples/notebooks/intro/README.md Co-authored-by: Maroun Touma --- examples/notebooks/intro/README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/examples/notebooks/intro/README.md b/examples/notebooks/intro/README.md index 14d56e8e9..30c7a7b24 100644 --- a/examples/notebooks/intro/README.md +++ b/examples/notebooks/intro/README.md @@ -14,6 +14,7 @@ conda create -n data-prep-kit -y python=3.11 conda activate data-prep-kit # install the following in 'data-prep-kit' environment +pip3 install data-prep-tooklit==0.2.1 pip3 install data-prep-toolkit-transforms==0.2.1 data-prep-toolkit-transforms-ray==0.2.1 pip3 install jupyterlab ipykernel ipywidgets From 71e0dc2bb4d40a87f8b09900edbc31dcb575a24b Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Mon, 21 Oct 2024 08:48:13 -0700 Subject: [PATCH 07/10] Update README.md pip install in 2 lines --- examples/notebooks/intro/README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/examples/notebooks/intro/README.md b/examples/notebooks/intro/README.md index 30c7a7b24..4a45cbbad 100644 --- a/examples/notebooks/intro/README.md +++ b/examples/notebooks/intro/README.md @@ -15,7 +15,8 @@ conda activate data-prep-kit # install the following in 'data-prep-kit' environment pip3 install data-prep-tooklit==0.2.1 -pip3 install data-prep-toolkit-transforms==0.2.1 data-prep-toolkit-transforms-ray==0.2.1 +pip3 install data-prep-toolkit-transforms==0.2.1 +pip3 install data-prep-toolkit-transforms-ray==0.2.1 pip3 install jupyterlab ipykernel ipywidgets ## install custom kernel From b3acad2eef97cb110f6b049af9f025614ce01921 Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Mon, 21 Oct 2024 08:50:09 -0700 Subject: [PATCH 08/10] Update dpk_intro_1_python.ipynb Python only needs data-prep-toolkit --- examples/notebooks/intro/dpk_intro_1_python.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/notebooks/intro/dpk_intro_1_python.ipynb b/examples/notebooks/intro/dpk_intro_1_python.ipynb index 91bb79060..f3659afcf 100644 --- a/examples/notebooks/intro/dpk_intro_1_python.ipynb +++ b/examples/notebooks/intro/dpk_intro_1_python.ipynb @@ -149,8 +149,8 @@ "source": [ "if RUNNING_IN_COLAB:\n", " ! pip install --default-timeout=100 \\\n", + " data-prep-toolkit==0.2.1 \\\n", " data-prep-toolkit-transforms==0.2.1 \\\n", - " data-prep-toolkit-transforms-ray==0.2.1 \\\n", " deepsearch-toolkit\n" ] }, From b236dc062a6d7c6db0f5d0e73fb3c32348c25ae9 Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Mon, 21 Oct 2024 08:52:02 -0700 Subject: [PATCH 09/10] Update dpk_intro_1_ray.ipynb We still need data-prep-toolkit, and the ray version of transforms --- examples/notebooks/intro/dpk_intro_1_ray.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/notebooks/intro/dpk_intro_1_ray.ipynb b/examples/notebooks/intro/dpk_intro_1_ray.ipynb index 04af8ecd9..5bf90522f 100644 --- a/examples/notebooks/intro/dpk_intro_1_ray.ipynb +++ b/examples/notebooks/intro/dpk_intro_1_ray.ipynb @@ -150,7 +150,7 @@ "source": [ "if RUNNING_IN_COLAB:\n", " ! pip install --default-timeout=100 \\\n", - " data-prep-toolkit-transforms==0.2.1 \\\n", + " data-prep-toolkit==0.2.1 \\\n", " data-prep-toolkit-transforms-ray==0.2.1 \\\n", " deepsearch-toolkit" ] From 4d070ca8ec30af78ad3c9da9983ed8a8759cb758 Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Mon, 21 Oct 2024 09:44:48 -0700 Subject: [PATCH 10/10] Update dpk_intro_1_ray.ipynb We need transforms only for ray version --- examples/notebooks/intro/dpk_intro_1_ray.ipynb | 1 + 1 file changed, 1 insertion(+) diff --git a/examples/notebooks/intro/dpk_intro_1_ray.ipynb b/examples/notebooks/intro/dpk_intro_1_ray.ipynb index 5bf90522f..da33a3499 100644 --- a/examples/notebooks/intro/dpk_intro_1_ray.ipynb +++ b/examples/notebooks/intro/dpk_intro_1_ray.ipynb @@ -151,6 +151,7 @@ "if RUNNING_IN_COLAB:\n", " ! pip install --default-timeout=100 \\\n", " data-prep-toolkit==0.2.1 \\\n", + " data-prep-toolkit-transforms==0.2.1 \\\n", " data-prep-toolkit-transforms-ray==0.2.1 \\\n", " deepsearch-toolkit" ]

8AV^n$4~yq7tEKLDqGYPR><80!yP`XKh&P~~v`d~x z$f3tF)kG}#(9by8FX1S+$Q&%$lA~Q`f;fxPow~x;ZLh(eGD`1Ro%@ota8cfeAyO*( ziQ3*(K+R8>55({a0k#FRsjHM)S*2yCW3X@=`0(k2ObxJQ?S;b^4Iu^c z>FqFd;lQ`K^L2Cybz%Af>?a$nHTE(ldzF3-(`c`IH$CFoYcRACdEi97o*ypY9WS%S zH%I4&H-gz6HFe*86%9Wbp2eZq?qk8Eaq|~m{r&yO|3L|5{y!+8jLeMe|H}gXj}pqx z#>nu$%m0@W%0$n=!utP93GHxE&f03_jqd6~+TJN>??Q76MB0w*`fs4U5r!0wL@8~T zxG%)rI^Au2Wq*CWVPQ;UnO!oSWm?rlBTzD;xR5G1fh*-F1COUf#AKx4<>!~5>e~Ss z8kr~pYj$pAumi7WXlQUCl$Q_Z1d!IQ!s0_Bsp0eakF7xETc3X3e%TI8PKNP=#)Ad= z&)!Ko`B$LvBb-|n9-KfJ1J7rEgDPWdYiCHyXz#A@$Ct1`h`-$u-tM2DSld{fe3F5I zBQ52L`*n|`13HqLw!-4lwypvo#dGk-Y9wIfAMZcVHiOM``U8~mQz3Ev^A7^$2QG)1 zS58w_0vE5Mu&OLyV(0@`b#rodw*SSWJ-o!sT#N>QR3L$o0JtY2;@4EtIhrd2gMXgg zOGL;oeyIIRI@`IA>nbb`Ef0;2B*vP`!vwGfaLaFsddX}48;SYM!PvL6ysJf>g?;rS z0hrh9=;VBCs&{vHr`PK2XlLYN*J9+_{27rOn%o54KiIea=kCV|eBaPw;dYYzzYM--D1F{RfF zMM((-;DuxrPX%ub#OTH5%;MnU1c?4e;^hZCWAQ5j4*~Dy*Nq zOSRPY^vd$^euu&Y*wNwrDQ5TEWX%K~AJr07HPuH51cwcd&k4j2FSCuCp%!`Sw|Y0zo-^@R`D+GY{&sBCNZd~=f9tP z7u7n&)jR>Gl&iI-s(+R@eaDTpq<2qrP>s#Ny>wH0gfo0vlea#udvkbKd}C<-5iv<$ zea?4A7N-AzI@!b}*bPJKFTArIdc|2hvU#2ve%cL>e&h~5`ylVS4fN6MikSY1YTvz@!n)o?J@g^U z&hjC8ef1eNHuNen{ec?;*8}{;-tT7gfgR(4-5t~XhF;%c&{Y4qz8AUJJN<_JczLA$ z`QJfT|DHbe!m8@Vs`AhCQSX|p{7k|@}?iXz6MK^kIXMEQR zi@(ped}oaFr`q?$nc1B;?8+b2cXeX?g;(+9*L>j9s(~Fay9)bfj&sJPM8+2Omb9{b z%V${c@c3KvF~@FgefRS&*0#PI<@V8)btj2tHRo5pxAc26{W810GIW4<=>}%|-yJM= z{s8x`-rmP1d%XULn7_OVbo~bYwi77*Y)krLwO0Gu&i}O6{0v;w%)qtsQ(Hd1l(ER;g@w|Z=yik! z0F@G7+m=QnPZ6qhi>2TQ5zrg>NEA{^nEgpL@Bqh%tmF~cP#6!npxg~3r0TTv6Rw-t zWpwWueE)q;FB0L6wZz4Sk6Oxi9$#Gc8+6i;_?7lNleJvaPthK*$_WU=1t@z;@C&*g zL~NgliR?~Tk7~7r<w^KX#(Ov4|I7|@+jJ3=IWTZ{u1 z?HkHW@=sThjJ}+VwpCx}~ld#<$ zXukHtid#+Ey10Ld`8fvt30|E8xWwnfThJ9V&wIs(f{1fecRWT)5j^%CXgGdlR_)uS z#cdY4svMq}Vk!$It>`N9x;m*dREG)}VPWnY*!~+gA*QaVS&XCah{zjp^xc|`w2$ht zK2Ys%MIc5>m!^+3K3;hLu(@-#P*lSz*(Vx6mk^8z0N*k7IC_g%Sa5swsg%N+8&MS3iAbE|=VIaU+PP5w>Y2CZw@jnB^?}M^~YzzPBa*F$!1>qo}?k z*l!bllvZl8Jti33md*!>xfZ01t;Ns|^JJU@_rNvh%;9s57D{lM+jc8^@rlLEpU^Ad9#!YJO z_|i!_4P0_9L4>a>B+%Oo|2FzA_zw|tXG|qJKu<0}{IEF~1`a%M9nxeA#3{rv6l?%m z7CAC1urQ;L%*xpA5;v#600uA#V+KG%1Vj|igE#iJA9MM?aMFs#(unI09!{*S_)?S_ zC3)3~a*$%5O|-?oh3kyTU`YdTU-9u+82q=jwKN$?M!u- zwXFyQ@rdV@KHJfaac zh-&9Y_2AodDT^$^Tyr>ug_{@|hv!FMSHch?<|#JK2M5-GF-^y&JO?dBd8hifj8RLEXe5$Wzs1q2Ou zgKYsP4N>NT3wDMc*~P3Hyed~XBtw|`9M>*ykv2m0>9jOa6I?O8<*M%m%Uvy61e1S9 ze#<%}{2xj$EZq_Ft6eBh(B9Gr?s-Q|S6WlKNC4y?0`E_~mW35-_p<)t6IC2v(<16M z%JP)+Rimxq(mc0SA13{aE6`w_IRz^HxacEZ-b&D(}%Bvstb7IksREU*0DG zRi|h%b!qt}i4QS$zQ@vk-7_>*>@Pg%PV}(dOx2E8cIWd5KcI;!kEfKj5qUN@N6!%v z$AJ5?Gqr^s5s_I~aoc8aoQ++C$$TBBC55k|>K%1BCl2Ei-JJ~4ylMVEb_nfcMx{si zSf@_Ev2?2Teqz)o=^KGQp&Qh&zs1?~@;zJ>`Ildj6K!1FC|Nb0Vi3C+*~zTSRf?wu zwq?sVG3{oB&}VoKp)2(`5&xMYd>ea?Vh}%EM-tsA-AieZc>a?g;Q5Su(4`Th)Z*j7 z>@n_}Z9es!VRYCb!c?;Pt}j#^qxr=_Pi|(nAMsecQh_6WF?-gWH2-RDY@;;)Mf-P| zbZ?dYGKsZVzJj@LItsoSm(6XrCK(Z z$=vIu>R4MS_nkb=<3#HcB}y5O%0sEwzz_u565U4k$erm{^X&B$DEeGhZkM$&Iwy=W z38vvmWCoY%VhOJY4X-Lh@Wv-~$?Rwh?ClBmgEgeDz8_3Q7ROUQ+pmZF%7l4)|; zj$4o%(=0GT#cj!(P|9z#BSY%xFqSIJv2)mZuMgrXNkthSoo(Zf^?`Pcy9q=P{BqrBC29jggIhhS@W5cHP ziDmO0u?7juWsiAd6R$NI(JtpkYG2!C^g|ymyLjo-j&IxXP?y48X+F}#R_7*?8UaExL;wxC z@@oKdmG3g-bR=%Q&QQE-0-?CMcxX8_;@iVBrwA?|y7pt6dJlH5J;7$N45SARg)BGK zpOj2iYj}{oK{)DQvO))xx##4}7BhJthWle?!D%qqgqphHaGhw2K&b-!-5ek1L_gRl z%!?QYu~|p)Rdz$v=rND=7Nq|{Pnvy4rpHr~hF#WWLbLs~^CxltJwc|Xo?EXZ77D^y zJQ8aHGjKlP`f?0abl-8R<)S47j#m%z6xjAV@(UReV(#Q%`%#gK8e>eQ z@A{wYbF4x0Qx4;6&qv&<{;okZFPITfd7ejv(NILf5-(@|c;O5?;3!rf-EmIxUM&M+ zYt}el`GnB|ME`~$V-1oiogKRHswu!XMMYVB=aQe>?&i)gC$9-~F6bWIGAld@NBriD z-H3{(nZ<@Ad4P|_^n}yUacs1gZoF(n(MG2ywD-z+%Ti~!fqtrB2vh9Zn4cq2GNy!? z`o;(0LzC6#jn|NF(cH*14ak2pLBp1!1EuljZiq`}=fk>SYd3&dp)#E;INLfM<;iSM zUSyboPC7hDOHl>+bYE`uTRXf(ge58MSX#2~i z)xql)yawOrB>}DUutg)H`QX+|P0kt8n?tt;{D)~fYNcQiNRT*#M%M^rg z;l;Gsl^#5mc`0#IBT3WA5RWM%4t{7gy%T7iyhgOQvZ+(R;JU{)z|v4gY8$+q!GSzn z>dW5aRPFhM6a*OS>ooClEi1)|?inG-0$7Z2V6`ep`aw&3W^tU(d0X1$gWEc{VyS8Q zy^kIUFLH&Jd=Yx(10URJSzn`W(A*1d+pjC53l}66ee@f=n+TbvLJU#q^&?vIft>c~ z8UD;@YL?0}B<34C<@$%SSMqh+2UAM46Apxax<#D|JxAl`{Gnt#DxS>4xr5N^ctcGG zqY;SMNO#52N*%t6I@9ire3}FIe%AClC~^c3T6_Sa1OnD^=viU4qUR?|&w0vu;#Vep ze4@he?StrG!*m!&e2C{dCPVI!Sj;z-Ye|8U8u5rO(n3I<*CrJ+Zx(W?HtA<@a9JLeeB;?AM zle`O~>V9$|m#d^4k&7}L$wwB(E1(LSL;}_%Bfda48;8PaY#x0)RkRh5`dO*7sb(7D z6LF?+-kI&oS8t5p-$Pe#ih%F@;c7|&tp@$I;1Z>~rLYP5m$kQH5)(Pcp>v4&@=NYe>opw{v1Zvy8nDj+EpPz;43C;Rl)3->=@&Tj;JFMUj} zZYDtDK+5{}Q$h$V!8C>8?OD(Tn7Oy>eoetiNBR6~D?*f&)9*`52Ca`AMZ4;jrClwF zKr1$v1t(nfBpb4`WG_X}pF#?B$|FHSh_DE`UGyToiqY0mOZWUMSiHy=#t7_qsuR%EP8K_UVk5FdpbrbTf0(UAA4Zb1q9?ry*mYq zH@o48&_6MZHO(lD<*=BvpzHjW2*i&Zm*U2s+Od&a{1?GIlH-OC4()pD9zQG?vx;W` z1KRmfz(1X$!9X_c5z^%>^0jP-#=>i1xNaiF7Q9)5UvbkdrnCqa%6^O~rTfIcW|UO1 zrH6;Ef&y;nt}Ih!dB2k-36+p_CunUJXGM?IxPGzox~1mZlw?;yIC@*4%Bs~FPXz&$ z^PmktV!4VpJ`OB-XYdj}R>&AbW436VZ9=*3H3+>Fj%nb$*D=QkR071L7+)yTzZ2H% zOUR;A8DM(Pb$XPO#yYs;IQJ?wW>TGSMAoB@_q6U4aEL9z26awXs$pob1Hx~}RVU=s zP$%fl>7U2VWW}9Aji5=`rgBZ&;X3Cu9J?}IR{F1e?rrl??19A=kf~z1O&`9AQ@U#s zgwZvs&PfZ77D#V7G%V1*v$Q2?iUicGzDzao9EkDKvthK13C|k&BH&9MTisvi_bR7f znjl>q)6A&?cUuf-+aYM9V`c<4%6qAObGlWdS3S7`QJ2ORP%okq?mD7^5w{)SSBsOG zT>!2719b_kQAXDb>l(k;juaMy$R`a0TC#KIP1J2XrORa-dSnF9kzOv5GHedOC+;l4 zWPjQ1v72VH=ErLWB^DVd-n{W7GXQwkJW!OsuFxB#G?fLeI~2Nh6(peP^5*oX^9Ri* zdD~2E6dt-$6I=6N500=?A7x0_I|Ta03j0AgwZgncf>>bg@u7 zPvMyUL@YeyQ>g9WQ;`_Eqz8f*f(?_dZ%NE?C*KIZvq*(-rVwS!MX85px6}@Duc1B6 zBNNqAYNFT^m4rKcMo;2m=Wp33DmcZKQ(aC_W7Nb}M{e zYTt!>g&TDBB8YmC2v#QW^W!9f=98g-EE^4zVIBwJh-?mT>27g{ov9O{<_g?@+0WM5 zh36jki3j;`*u&dsc*d0%#&y4ZMc}kLl1qc@2S%#z(3&-ESV{e*43k{mBWyB*=0{E? zx<-gb-QALenW|8=)y%!uO~^SboRFc?hLWtz;1Bu(&FFs5IJBHzNze7slWsDdy2y# zV>MY`hwc6|`Dla#<#2yJUvC@chyIKpCtay`-=`0jU1laA_!JJpT`tQG!Dpq}^H8px zw4Ywi*tiwQB67Z0kL(bTz<}evKIe3|X&PE>ciGjK zfO;^bN;t*j0MZL;<5|^|@~t@t2-Pe&u1_&iZnZnLVEMXr&mXdXSx>MMDb-&H^1gdg zGi`ToEMedM5nc`S0Hd>;`k5HJS%%>;H2pite{8Gm7Nw5nb zTsnV&#I;+RjWz021A}uoTkELhv0^CilsDa~bhk7SBiFK6C3=&#oR`%J*W&Owk4SvJ zQwE%0`&Sm(NCmdofp1-C*q3O(F*p09B16i58>W^({aPmn_{wnGq=RE2vUfRmD6EcC~C$cA~eJ0N}}b}s=#zZr{JpSx)ej1mEr?Gb(5QV zX|5jr&$rirpE47Eza@@QDW2%8T@A_ch!TU@6OATa?y~3!IQzl6W7kbm$&>o1`sEgL zElaC|1ephEwPN}G$YugGW#lR9%c>@)GSr&=M@IKcYd;_|wpHtx6ve6O$+yC!$mAmm z_xR76+6@*vG$bgv@(88IUJ}WDG6#zmYD|zg(2DYO{qj@DvyRK^%>Y3Hoa_kf+0k`x zBANX!@b;%^c1G7a&(q_fb-`Da(FjCgibv?ZeI8*OOPAKx#=wPPJh;DU%PGh8Qh5fA z{1rRA91_5&o&2hmStbk`(7O&51U(`ZSEoD&^r`H!^`lUfxG#vvEr^=uk}#$Ez14jq zVBI%-Zfa5in)eb!+S;$kg|G~QVIu#R&*ibCO*-?j*X$=6TIt64;mXH4WitCoBTa$! zfA4{oJ>HL;saSQ;ZcBQ@c6iOm8m*Q>2a{^`7T{~w4xoOQhPg{P!NYF;&DkUVP8R*o z`>fu|xMHGTI}kZ{nZp-JPxU0CmxTRmNT>%S|9t*TLJ7zrY)zfK=3}d#K zv$TDlln3a~dbiHeP6;eAck@R$`PPMWGx>RiDZt->R8q(=&wy|w5l${T29Yp8Q`-xH zUWL2MksOXBA0~W%E#L#R5Z`W7GhISlZ&qbfy(gX2=_p}4?F2*NgV9LbY|yd}$kM_Q z;5Ab+5`OT-@Y*a!-z#=Nf4!4r3{LkJBu$_)^?T^gb*mv76RH(n z@_NOp!%SkX@e6eVus1ZH^Vyy%flGOBt-%TbFUbjUc7|z)-vPPLvkC$WD}Cf`FPDgu z=BGnan#R4H*potC6{?saexP;T!#rm>#;X4{7cZbs@4APyE-J@* zn-kLDDekFVgDJ8KuXO*sR$k4dsBYA83Y4$;aBYG?>}(_cUgBO}ph1}S0BI-}N zf?Kv0u&V`X5NlgWuSAtizssuNhe{gfui8Lt=$(SKuH*=a70=^yaBElb$PQxjbu zmFo(Aw!(MqOMKTDLkzjpK~KpK8H)7`hFa7v=s%BFt!UCPB4avZMIU>W?bvM$I$hw# zikrihlYbZ$O}FA*K4bGC1hfCbbrTDT(ms_yv#~Pu(36aYmY`kh^xg<$TgI<1C~s2{ zaCp45aE#DrPS9xX%F;UFg-uxU%+Y(Ct*RN(c?y)%UamFgBqtf^IF*Aw^SI3@d- z!1bigsTpJD+=Ay-pmv7==32{LDuDL!fN|Au)k-x=1Lw!g2Nx0Mn+sj>7vsz}JhE?* zQi|yr2q<=Vv+(siyiP4k`$V1Vrb#dQF!ofjf!7)cD7OdKyhYUA0Q4G`n^cb1FF|{H zdN)9k#ggjMeaHPY$z+=G?tCzlED+s)QS7Kg(s*D5`i$)a*mw3+UFI#&Aw`=}16%AZ zW7{y#T^#W6wYZVvKzyvJzp{eIO-2|*=)3~8MFCtU)0?q8uKM2}W`3skJ5&K)f@CLv zc%X`r*T}}Np;Rq50ci`uTkH3@9w##UD2jV%{i#j@2RFg_UR?@{lqSqYsdEE)L!gi80Qro!O z99}G~bQo&Z(-=B%FkZ{}0rk}!Jb)d+iONGdq|vxwdlmQCyV&WChh)_N{!4<1biE3o z9vzD;Bvdkb4Js({Qg!3e3TcJ?Y+1s zk8>bA_8?mA*hij}k1dbDpDRV#PNBLRT&sJbck45v$zSYXOHVF@l3YcA#Nyb42*^4L z>XbrS zl$BLHFd-HBieJB)|HGZM6zhue+X;>)cDo9|{7i2x)+Q@g2+U-%rWKwp8rv8EKLyeF zj&e5yAHXVe`T1bA++h8jqSnQ~A*(^0O51}Ug9|*0zM%LF6$#px{7~ib7e7ADcE4|Tz$hp} z0RCXtp}g3*B)L<5PVe?y?ewNTA~@M^^|&^(4maHr{;%c2>yyPC!pjsdP9NVaTF95O}Prqw9ln7mlFL#p36k|hJHL8zS*zB#u;m~iaFzN zaARr9fiK=%)=(gXLz^=BQE?obl~$8UpayPtHxoPXQEJ3T;95Gu)?W82WPud5RrDQ`o&(P^X;NK0LOSUufFbZZ@OUNA?a?}lBD$cR zm^{XLC8Bnl4{(T!4Z(y!t<2HbnzB+trl5Hj?cZqtt{XI4^q=!@EM@VgiF>~()1sCB zFh_&#VX1Bu-x*Qj=M2Rr34sY*RYcJr?>QCP?mw`*hfna9)B|vP6@Z7Hw8AR3G>59Y z021-+Hmd(Dwf-a{(T~PMcaH9)M)F*&%h}bciy)t;r!t#FQb~ua{ZfeW7mr9$fE3fy zk0cgosVHGTrfDvp#%-*ML6&b3z+G+DUCM!Yviugq8!DsMBp%Qy=Bu}?9oaY_1M(am zFX?C)>b)`|NIj$WkQinRT%MXsB=Xes#F>av0X0pWPshZ@1S=})U`~7g^$Nx3$u4J% z{!=_OzjQ8S1snFOPFu~d*-21^DzVVKxM#L7qz2Gaq&E0m*J?a7tEmm>nCH)DWP;n> zPMx(pYaxq=NQ|=w_;wb@X)$$~@d=OKl>Eoe>ikH17-UT_%#wDHcvXt-26YEb-I({w zH<04IYw=Alehe7?%n1uTyv#-2G%|zc+7c$-X9-XcN3$@6WlIr`VdZAp}-#UN^2n6{mr zaVaF?B=8&WgxRC^LVe{{4PPUc{Y$2KmBq z8VAJDt)F{M03MagI+A4uqA|P)hxQ>b4sHN_=m@fy_oQ5PX|!KXJ!OETceR2dY4Ns9 z!i0`w@|{p@W+4Im;ZWfoeQ#y0!L!5T$+=zcVr{0@nYyZg^Jr$In;EnUePGsKl^&Vg z0+aJSdM>X+%spqzxbV%DPSr-e#)C}AC0b9W-&gS_GsKsD9#?3O(C}QOQ2aV^C}Y+D zr=5cUZq0T;V6Eg4?u#cm|1`jsn3HT~fMSv0U2wVzkh*L*gk&O?8Gq=0%h%X1$>*!A zbkx2tGl~*H9)cTy?Jq9FTMBxiOOgxkaUq;UlPNz}A~$cG zGSC-3y`k@6sds`jL!ZSnfz~S#AzaeB(HHzoKeDHz!nh=P4`%k`d>h&MGjHxzv@1;t zglV(1+C}mgkvaHW2+1pw$;KiNrXEXy-4=GzT2H{YhjJ!5v?3@rMOa1oZRBD3GtRC* zrCx-W8^F?NP8&%VPGunIDLLL_<7d%fu`nNNQKf-4Vv-aZ?c+p^W4Ax1(oOWt{%4^1 zi%xHBYASf<>uKrm4=5XqlplmeN7CIer z^<2L=1v z+}Z(Nkn|PV19Cf1vV3%WW7l^=^MZ4;<{tj{h`rpnSQD4b+z?p80q!-S)u%)tGbpIt z{{T!tv%gCFuE0VBE(8RdhVYFs`3mZai0NevvA}%e#65m!!c^MlnDx(Pd3Jz)ZyGPv z)G{n1-Ix>eliXd+_=&t^ElW&G^}fPB9%{;`9C!yvK7ZR;2r4d5^#3%6;n+#{dH7>9 zX+2eLv5hGZXcrOR1^0C*Ty%iscL%jDc1UR86>;-pQ>|5xu=clI1M@e8Q5zTT1EiE- z24dJ`)XQ6msTh4Y+fyJ?c#SpO^q|C}7-?q;DrbZCWxU1jI0uu9bi$ceUXX335jq!0 zirz0UfQH~2_VU; zRP@0cD+EE>8;~C3$;}^Ez`zU7Q1!f_*$3nY7D4NQn9LFwaa+BnLhA-~Rh;v-H6`kE z(b9I_URC@keN$07{J&He0f7?=GREWmYn(JNNQrm{n%H*GQyxfM>@7~gYP(aCVFIUw zhB4LZP*|?D90c1?*Ci?iI7X3jGu$5uFD^$w$e^`S~@Jl>3X@GS9aDy zk>^|p-;auHT1y))q@9u3~0)jmjkQG|LleBM!P$_-1ngC{TbFo=1w zTwFgFWVL$Cm0NY|E!pJcpM#XxZytFNUVvbt%>wet{vyIe<=OsZ+URHco#O`s$!t5s z*@0ZIqvq|FS=wsTodo(AhTLV@0RmWr1S&BICB3DpqQvI≫z6k%Q51@%pB(=SgO^ zFRa_3L=!Jms9T2(`W9K(Pjn0}r#1ukpOU+t2!d8kb@j=Y*SHB#-x!BSGp%bMTRh>U z^Wv>}+Rlj+8b0LvyEiKxQGKhyaKnd@$wj@}HTnt90U1xYt7SpyCfViJgD`R-{lM6& z$CkX}4ujQq^XsTF_Bo7Cr@$89@N-GVK8i_TmGWhTF$v-Xa2X-fC6jKwwgmH>D^kuz z4{N-UTO&rcmMI<^>8h*yBT zI@JsJ23#3`8B33YiAxSI=)ts#zODryIntL zVLalNjy_tK)1JTM(b>G$2HBj=59vLvr%qm6ffw>`ztC^()oX_(rMtuDCkT9JRbY&S z#`GnFLv%r?k)vl{aE{VGZm__O_OOqZ43>dM?AP?U+$DFK@dYiZM`k?U=nxEL73*SX z+$o;DKGWHM4nC4Cm%eH^8OLd-1b#%E)x35YRr;|QTWd#@m7B!p?~cF&b^D=WeniEh zoDB8V0{0ApM$BFhQ;pC2t2`QF2ZVsni+6vv>&FTKTsThab8eshmzcX1ScjA*g59%2 z3TM1RCd1tGjuVPAY!b2xmlCL&uHD{D7%dND8u{iTwUxl5COaEk2RoUQRH^FE*>!M0 z`LlyIYn- zEt-YL-67p(O~X3DN51T~ug@=eIDV$q^{16^%KHYZTkW*X|FivSjWP|=3UKb6|0$G0dTf0qJ0Z-1)eAHYs zLxaFfS=T+LIidj1>f!d;sact(){>&&wzf0;M=X%#p9&dn31phb&+QW^>S@c7wMXvUw z6e0|igQtlz)mE9rYr>Tq+HF8AORmmzopTc$)%PKx8jVXzPZq5R^)Y4_v**XOEg#xp zBMKcAij&x1)E_BzyxXcDm0YlA=wl2etG12*rrT?k9e@}Nyv&bU+k-Kh7`$DMn-rYD zJSm1s!Q0Exz}po>;Lw;rnBoZ(9scmdFjVj@&5z(DxhJ*yYa52p=(J;ebJ4`Xm}a7> zZx^@oTZ@T;f=58pC&HDO0sLXJ;yWYwXb=Wo0$-3YXvPKDI(fpVaf}+ zvcy%h`iCW-&V1iony_}kz%?w3>QWRoR%04u1Ur9n^GqnPZc;C2wsqu-JOfuvmR4_d zek6w@4Z4*nQZ2!)@N4A_8u{~%R-n>#F`GzLp=C#DDU9^2{iftAu6zEq0D0%9Ps*te z4QIrZbT5eW6x7|?M}VX4-{W*6N1xeiA>KaUhhmGQ6ZC?$dxSG*whd%wdZD9f<`e-i z)90=XY^?n6{}{u#FH1rO27_a%0?pV5g~CC$R>+e{-u1Y(7P_lPUm@x3L0xD@r9hDR z;Xa`P5|s&FLzr(-%_HCMy9?sLVc7Kn=}-3E)9u@-*KcW?x0YTTvW3__zNFjmXwSvE zg)AEa0=HC3I$sXk0dB(xwL9`hFH(r9zVi5L_|g6jnDC7&sE*Fl+h0;_LyUx(j1cb_ z%+v2YN!cR>fsT{4h$fnQfY@}KRFnInlByzKw?nXI3JcO)OEi;bkHw?FngvS=w9#kbrbzD&LzjtUb> zu2m&L9I|R-0Ap)kG)DOoQ1G($MK;z|M{cL4;SA8s{k$egq%vaLTP*~^VIzh2=E$9H zE&e+KZRNv>$Zo~}_?V}%dYlSOoXYt&&+H8SNDXPdT?>lgC39IV%2^d&MNX2h(6sfx*@2%Nfm%>dV>bAagbjk1E z*-l-3$wQ|ncb!cI!@fG6C{I3!h@n1r7V^CQma;s%r+x{kl`TH;>WXxWuDLgI>Zgxa z@8KYESnCqL?ZR>zR;%>up%w z=7z~y(|Q3=NKq)XwZ;4YaTnSWlgc$f5wJ_2_*r<y@cQsSDMr2(>JGrn6*yI`L=h$D$@AWn#dfCi5C|5XekjVEd zqz(lKz{&pNd@(VjDh^(Xqk$sfeLf_xrHHcySs$e&&?|`PRk|PdF`#?s&^6LonjI|S zd=~VLWL`=Nb!Hn55%wRCz1qwy2OpZPh2yl`oa(haEY$`gn=GFgtkRxruid?Qhrm8k zRxnb~@+SNw{7U5~hr+g+iBh>OjsDs*;`sWpk-rkfvp$m1s+BT&V+96LCD*$RpS(N8by{Fl_{?||0;}Z_267pf zr)oJMX)M;ae9}VhgmTmYaa6Rd+&?H!*sd9Iyf6ehIpI4;vTJYpuAqE}{yk*)Fb%9` zaj&D;-4@oON*YP!OZ=>elrVuWa?^xaD%mGENipHXgIq#vkcy?vDw&qT8d=GX;8i|; z)mXNt3G3oigQ_xJowYkyNIK>T(hFg#0!(|V)hA)(<7j{708CE()e?wvqtmn8G3zyb zzOghZvb(P}Z zYLf>|i~U4Z=hRF5j?`bt+cd`rgCUd>%~|iP{p|{Ioy)!A^6En3llSUsB1taOg7CT5QD$?8-6qj$;>WT&dvRrYPZ!zB_IK| z^4ce4Mto!KiKf>5gDVIRYvo3HRQs?T_p`;E;jzzAh5d~VYcB-~xFI^;)3l6@Usc^L2>qRmkT@E{+N3;YRA*~bQbW{9HPz@pRh18L~jp9GD^$tJo_FSD7I=+1HN zqp5mpAJF`{7^^qBe;2XjE6Y}5gY=I`dyFJ5x22c0QjH}WP1rD>x1fm}xPnmE=CC=u z&%x1rAikYJ$M{H=R;{osp{0@pv`%z?-?|2Gax4|^AkOoSLiE2?=Pg$RP&A2ioHcCh z)U1vhK6V(4z9be>{=TiFJ(L(8JigzEjIpQgg(Ra72uFwGYsjVb5hbnWj7bT#^Z!~{ zQG#BY&S7XfZW;tFlqlpk4z?&iq5}FMBck$Yw5RYR1PqMUexDbunt*s~v%KGqZScnmR~TFm2fP`}yISWJl3avkPSI@^ z)@o`Dm_FT}QCsD)kU7t&1V%vi8>%mlY!rwkYG8ND|5_M?+ZG$^7P<0(70Zk4@o=!A zSO+!}lXksfC#;oQ<(e>%qn_E*2s)NjRyP$JO&Ij};C>|tCuC>MtJDxq1 zs%+5ZZvj_oBJtqsKBu|xb`g}JakSNxM;;$YEI}p_cnBO81U^~$XMkrYl&pe~My&ac zl3k{He6Mh)(WQEJVY3DEMx7Tj0h&}>_5xpOAatlE_n88i8Y|vATBlh^xPrSnX&1HgNejp6$r-Y^Fl>~>( z+2SU4XQw+wp-}R&0FEFW)G{Nw54+Mbyr@hU$13&CE_Z_sR1^GpmeXXMwj%{fZdq1$ zfM-05^VHt0j!ZQjI3`?^4;|#)8+Uoit0FE`H$vfb>7dS3rrdlF7;uJsSNT=#{7pSi z=mv&&xVm=*94`%yS`h}{L)06|C9?s-r5|YLEXOt5YTVh)@0rrj4J$lj6Son=OpRfl zu?b$$e6R^Vwwrq-F<`VSo@@oSc;ACM@fPeyG+mk-QN8o+jpZ%4pB;Em9wGjK^aFoL ziX)5xjXGh%b)AG8T)|LY!|qfIkPgai@WOsh0jM13cxC6Uc9LIgA zYv?~eDpjWz9rG|X;@Ds}&d^cyGx*ncEPbH}lN?Y#z}%|pnfaFYi0pKDogn~81UbEV zo&=XE;0`%R-?0vsB_4yK!fE6#&;9h8JdyqAK#$MuYl|bteww<<+?syYJ$Ond4AVIp z$Q|&b%kVSTnu{Y5xexV+TmtD$nUL25k0wu|xG$Ji62+wXmw?B}w_nGi12mca!n4dBL^k?n zQZ}`TSS3JSO4ud;@$2D(_L7E;HW<>3qqc<5$$nA`#`n0+9rG9kBG3DeCcdb+^LK>W z6(@OdSBE|6+-|f=xRPJe8(v2p(;AlL5e;5L3XIDC9WUqicEmfvQ-8VQ?sEr98S~1F|2{M zTS7J>FV6hBs}b6_eG(3gL`T=vNyZl6R}WN6q%rPf-A1w1{ofup#dR3)1S1^s2f=gAyFNyt6j8eT|$}-SuWE$BBhMl6-l-cR}{)@?r8g2GciTxOP>cCYR$` z_Rw%7%^$I#)O|NjCfjZbY?4*mT6+Zc5}@Db*2+aD6M~wpXvsgBe$1jxgzb81D<`VH z8f&2HeX?VlGCTA3@JpAYj96aha$d?nd^{UXYDn{kU3CV-Bt`qVHqP$xb_`oq*_WPf zL&ou}xMltFSmU+8A$WMrC&`!2D=w$DlXR!;z;S*14UpAxEUWnKkByXAW#}J{Y<%N< z7*qTUB97k>9+1Ub)w7XgUxwBh5g!c4zAw@mF|?`6u^g)d8*~sK9q@cjhi2E#TGBdXt4G>od2wTJ zUjgS?v+|=I=>kWP{!MCJpT5@w^qKn?SG`-Z{BzFvV(nTEri>5NKf)W8Q|Rn ze4#k{>`)xHSK0PLViBM3D1dup%~>=NL~dZ?`{^6R$v1cJp9I1f#j6FoVK)RBkn)H_ ziu6kGM?wm?QeUiGjGQ27C^d<*iqJT}b3&`yGbwYna;Y~Jk_3ZuhKtK^MTO1hD_}7e zR7iCYb@+b#!sJ(o^?L1F3Tt}Z7WVTfr*$BU;R74B1L#{pL&A5&sZ|xx+z(5-2bI)* zKO)_0+|&wwR!1#bd~O#ct80*NgD1IA$5(nbtu8R$-*_$*k+?EKkspf6`wC!z+vP0z zil;559dtg#ZQ&lz>-?}I6R~9dTbA|wM{zTqJea-z+)7+pe&byT9fp{!k{TNk#J6Gv z1YP80byxn(kPB`qMPmJIM1ZlY0=k4CIwzm9DPglDR+gom%wl_gbYk1 z-3}$6!s{e0v1Nsmv2u^FpsMc}4l=m=8O)P0PudVD02jmIQwT){%1Ij&@oBAud3EPa zJ4B2|W)Kwl@2~s>wt@^~Y%s01bgHk@b0f+IrV*w#1&(tYgi%;8i??mFt!LCCtr~@F z+eF@Uh|j+}t8--v1FpB}{TgAr=JroK_HFdjJ$?}tzmk1Jr`a;a@35d&b`eyGOg5IG zP@IaRm;Hqp$O&_J)?yxPd`1e7dg%ad0>ipfDUN0e;(GeTx*NRM&UnaKyMxoo={M<` zAae0w6CRSlkNAY5!h4R7WQ@&^qy#n{U;|ya{{OXcPCKFi3IZG3wr$(CZQHhO+qP}n zw(U7%@7-+f+x|t9roEf<_AM#e_=r!^D*$nu0sn|5`2nK1N<`wCTjZyn3;;#T&r!}5q4vr*S@)*?2tHzZqNhYmC{*Lkn3~#y4n1CXf(y6ClLpb0IYsC6BJd6$w{eyD z$?9lQ{j_<6$$$0W)CC)9OI;@n6U3G^%X-t6mCfA&SUL$Su3xdo|Db)Kr87y zNi*qI54JwtYUVxa>}7$03`G~^;|Nx51{yqI!V9hAu@p2JF7U8dNTFHus!)4j^L&&1 zTL$H?mEm0FM?PTUS~(LlM&@M`t*S^WR&yroE^y|v7~-LWfcyCnWr3JGW*-*=D%j4) zb;!;dRoDL6YEzDZl->yG*pNqP6xv96k!%8(M>Zd>a#)RAX**10V~j>^?93h@CFFT+)O)sEva#`xv?7> zZp*&EQ%1bFEx>jlXjvuy5o3{gY+0~7<^r;|R~k5k19EHsISnTTxqEM|cSN_G@u z`ZJJj!yW;4?m{((aKoy11TDt`Uwf?0NAz5|C69-Y(>Tj&3}of*sN1O;a($@O7gBIf z(3MESx?&Cl47BuBUqR)^k5ju34o(WR382_3ZS! zc&s)EDsCk`&W;(jRF}mte)ohPZ>BsrzSLHS_Y>-&l*6z9($fa@z{eBPZj@DP8?#j?$&i$CfRCy~M6+!=l8-Yga_S4lNnX z6NYgRU5ut0po9H_&K*hK7_SQM!jK`^`bbeRKd>%!X(T7#yVLIcGbnoFVS)$J+LiWc z;^<$yKrz_-;A?3sEUxqgjF(X7ARc~Na3;uAQ_`95VJvDivdWqrHaN2zOuW;Ke*8+F-$%LTcPI~s60_7tP;os6F65WS8JH?!*te!TufMiDx4to(Xip1-rBYP8QA0X$O)8x&?6j{_IOfA# zDGd4@9(}1CN4f(es!g~ZQeu$np>O&<9@U>a7lxo?Ul#*jioUL<814wWL~bT1LGEO$s4m z^xYx4#gU^3`+!K42wpb{0Qj8&8AzKv2<0Au0*J^qT_iDKI_#--<{oD8B~RpE4;uV= zpno{i&ie~$V!{2U->JXCXqi|#3IvN6d&yWG=rRwCaYD+ZdVEw_q$*sOFzQqW!R&=+ z^t`nHj;)!^;?q&*^TwSF!h+{P?m`7HDej=_c}+7`h)N2_8&`FI1t0`!ZE*Wrw(RO7 zCvBUUxjNX1ug*V0zzRseVTYLNxAjp%a|O;C-BtXcUj1Z(^?+IkNw8Ug80OS-wWYzr z(gkDq5}ruhn@^GYBTx+TF0>E&bSi6Ak0&a96t5Nt#(^o2hZ;SLA$@G^xoxOB8{Z~# zT@907LQkQ(4@Q|q$Rl;t3J7x`#q4Z(97dqR()V~o1Qm3WaVj`4!Qo1H7JHZ_L}`#Z zO{)wvZ0taY3!LcoLH$3q40tSg6%--&rZDAVRQhb@Be;JK#;CWQ8OYYxBF-7A$CJB6 zXT?3b=(daTJq!1rSl0AWi{X76sR+NuMx4qeO91*^u9%nZ+};->{%YoNS=Hdnj0X%rWG9g1#o z2uQ0!Ke+l}QC{ah`EccCWj7j3Ypj$@D@7EQIGsl9kZpp+ynr2Scj&yZrf*jyXHxEr zvFO50i~#z$NZ8U|I}5}wG*A79LNl@#Bf=96PwK=qGWpdWtf&_BPy;;ZaU;Y{&LY=YRA8nI`9ML^=)R)wJB?Bp`)eWw8 zFi0OFb|Y~WeBpRi`lXo&K~R;xjb{8>Q>z9paRr;r8(>msi=O-Ue?M0pZd0H{_*0+1 z*wTkBP7bo*2$l8QW3B)LW-Oz8R)2^QLc%K6j};u|&hitiNKF}5B&p6%jydknq-;Mb_9?o0fII~gk9)F=Ix zYK1etpRO7IDGm!CGJ_bOC`p?l84hUd7m`y8GIE&J1|M4pxZQW+Cf>{h0&If2C_dtV zHLkEb?sHZxUs6{tAo$tQ4*ss*UVMX&l%rh#RzNOjH`U~)HVV}xNip^J|a+_3@lXsLRj3+l}v3akVYaz9=kKVh__)wml7qb)Q&p|$3DcjooD5^ibN zz1J;wNF1?>R+k$3;Z!o8|3&gK9jbIBFuJq9^3TCSz{pKfrLrTgXxj(17GEhbe&>mY zNzvXzw&g73qU@|6IIki4}Qq7^|K+26K$gT!qEjShX(&He(6nAk8`8b@pD~IaEsfx-NF*x5N9Ts zWcYa8FU0~0`bzX5)d*Dxu3#k`be{UwvT;~f-h{)32J{kx4<`J zn8RPDuu-XxrC&tz!JALX0G@cSv0j%{>affo>xPV2joYoQEh6-tuvmY*ge42dtgJ`)V9}LXVmutzjB4NaHZknT+Hy0$9wR zC=fL=TG7qmVbT~5b<5^R(w+*`6G7^>!e9d?2pK=V3+@>sG<efJfGYmeF_FQXtpgmyud>`|gK3oYv}zF#FUx0U%@q zPGa&c(C9YiC{&EkEEtlDyY`!+`;xB_0imc*Y?M8YFGtmwEFA9)MVu8mMMC8_xY3<# zd^ptT1N>Z(elM=jf{d)-@0V$P;3%$8{z6Gc@~8MT5By|w9gG$?61kDfPO1>{`_|Eh zu&bOH#(-x@#VTe+I=y7y#A5hqz+1e)#=}u`H&av17+Pl-5FW;%)=(7g`c%KJ>I*9R z4ZF|dlXcUC5r90Odi-h0YcEv2+$cjFJmG9JBtw=3NoY!(bKkLi!AeXNzws@@%n6=$ zdqh(>4{10QQmj+RDauRq?@QJg$5xs3c0D*YAS;i-=MP@{uG_-_g=8>KzFCwm?CIXh+|vc zEnCbdJ#^sRogMDk*Ani?x!r0$U#YkXI?CFpBF==^0%)A+CGhxOt+5fLXcP4~JNqyz zOmI7e3=-X*dM5%?h6!;WmPRh!3U|2|kuJW($zXi9T{4JFcksPLe?y4zU!?i5KGVD1 zt*^GSZH8DAwQC;}I;Ww3I23akp-m*5npL+KWRcM++9Jg6?b`Z2%lG&^rpNpDJ2sOx z)v*}SuM?yqz0P|I}w^-yUxMr#AgWb(B$8= z*^FDEF7|cFn-$dR{&v|-DwRBYO0+Q&&{lX3&x>1ULeDiisbWz;@MWQm~Bg$k%23^?Dh z;Ps2`x@zxyTppQ%dx}ObyvHyn*nD{9Zg379aO*I3AH@iZkK08a#n4Lq-6)6ysz671 zjOlqz`SkK_JE*0efRAb&hoktecy#8CGWv<4UrOaFE&Nrgf$`^+1;{&MLFqLny#gpy z9uJIgb%YPIGDA#L3-((mjR8XxO&~P^jBXu~nr|MymcyfA7am&0&uv~rb#vRX$5~BvuzA5GU);H*_1cn=pGQt zk8}fi3&$DtrYxo;N{cW!UYdBjJ_4ig4#;ZjDQ~F7c_IeqHR#04(zNVPmZm3yg;w^R zOjlSeE7cA{sYx048-tvlWU`NRhtVf{$)!eV0 zF7D;CYL$%g0=I`gjY-L=uWyNwm`jbf3O&Czq*QqDLj__OHd?7f4}@H$ZAYIeM}k0< zr9r7T?=O1B(az7(*NSle0hGR7FO=_X8Ab|wEU=8`A4~7&5vMJ02@u~{D+YwA+rNS)n^{%Pe&Q$Uz_GdSH2_Uo(MPRoKSyl@1JK3S&kD_5IIZx2<2VF zP-9TAW_l&c_=(O)6ZN=n4t#eq)Jrp0msT)8NPmh%#o}%}Bpeu~@`kJh z$j{^trqT%Ykgbt0G?;JWrHCcFt7}p&A3L)->0C)=)pRbB6dz?&Z&SOE^=5|u*g`mY z*%%7mHxjw!Qss+MTW?h|M#`d{S)eQv{u!BpanocgAX)lD^U`5O%O{qP8j9o6!=YQ< z?)bM&u$hY!ooBFq_Gu|k%#WELT&I!Ge`}wSJy4K+BK*)kh+1>(ws#W|waRnRqNc_| zg)Ncs$0by06ixArIRzYRADo7@W5P=EF<|lLBZj4<1~d6<5UJ~|nfnXNvex3-cN^f9 z#%Y33L$$iUc6)ze8deqU74>(_y=;}beXU`v@1H5+aol9SdaV{3VA;@#EVX%NJ6%l0 z#6Q4^GllIim%Ixmb~fl701!i*WQ~`<3uEJ91~FrKYVV$Lf!u57IvI*ejxpt}^%lZe z5@5MV5q*He`stBV^OhasqHLITtyXXysIP7Pf+jzoV-NpZhX!F>rY@mcqQL8bJi;I` zO)IfAb*Sw1WfOUhjYus){Q+yZK}!l}$x&KP_uQ(d>MS!cd}Yxz+FR~`hAv#h-_~ql zeJ(Le4+mIhn)dh24v!C(RFDwPrplvJf+r(RKICG-hwewQ=Z7G+1MX%fJB;IhpIum% zj^=U2l2mAdq4@6VY59Bt4G`j;T z{0`ss0qNuwoj4o>&QU=#o+dr&yBkAP90)2~(ez&IG`t7MSMa2=Zfj_zh0fXfJ{`0D zxEdD2DA`jT&nnRZcl9A#EjS!MGv*%YrR@Ep%sH<%I@DFAxXH>dQ~g2yl4?GYW=tY+ z&@r<-vTiEfJ4*n|S8HQ!2|K3qP>T7jg^*LZDAx?H&x0L0K8j(Je0pZKymw8&EAJLg znX1RPk+T&!Yk@o5>u_!43)U^VihB<9IvWrLo`E}(6x&VjHdG;6rqR&tm%o)4ELI0wH zn`C#H?n1hAW`B@xT@33=zgtb2o%aQ)EXxcTNL1}k#}<>3B@e9B;JGFFTYk8ETUKV8c76)lfg*LCt9+6&DjYZ^#*Yl;hs6lA)!7%*0i|HeJTBRx)+th~W=(NpQ zF|^R0)F`F?nned(JyvYB6Aq7J^EBA9*J{0*G{o{#-=fWdcr0EEm>8V? zz-6Q3XkVKQxyD11ck}L;ce*aN(+0IMBq+(-U{d~j|MH*SN+wc!+6afc3f@Wy_eCot zG?J|Q?0|?ZQPJi{8Pu1qJq&b0+j3P4FgoiUfZACDXTccYk6hg~0q@RW;sdiwlomTA z(eW9=XfZJ$=nhtH_A& z=L?9X=^SE*%iI}K1H^C#MF zDJXf0g~1tYY)cgV+nlhySle(gdT7fwn>s%@PBeFTFZi zMx+(~BN}mHq3qdTlv(Mu3uiOi11x8iRzpt=BNCyNqRpxn-KYoaU-FddK*6ThSdAki z>x!VXxM+%9NN{Ps`yyvT%WUT6oj*x(wR9py%NXa6a2V{-=W$Q<#G9{_{oU+g1ZiJJF*yEo-_t*_Vv^Mgk=$x(5}@IloHhjo;*Cz8KPSU$cKE1% zF}`W7PpKof?E~gs)I97#m-96zvyw_sO@=*X{Q31?UIAs*yy>}c6SX|8^l#l5*V^L8 z(gg)SGA2PLFjL7-opx`gwURf^{SWU$PgnCLG)p|1`ZM`jMnw_Po`w-O^-?b?b-l}X z#}yO6@mo%Zs)K%VCSlIU(8QYglHQ$09TSbku?~AXxd@A?bto$gqWM8;N=Y!^y@zEY z1H>tT;DjwY#~BQ`jEx``LjKmqMs#Ze3m8noQeh|4D23ENs+=CJXHhe(%<#@zhkT1D zP;{fSZ&utSa-O(L&b$TELCG|&t9`-cNdgDTB|5y9=uiq^`}ICBa6Te8&5My4K;!Mc zw|B(PSPG;$g1&B3@mN*U13=}hrL{ZBsnHypZ(0KnG|KD#g!#?RmAi3W0(jqOmuH8h zNS1~uCo6#beaCBK%XGC!K}Doz!6vwkSxfW5Q;Y5wNc2E5Q?mNOy zFiY!BX8x_DfFJItRv=*qePIhcOzc;-HZ2elU=wRNjvNLKD--SnFX?v=8S~Sq_K!V7 zorb)GtO~ZFme+=cK0lR!Og6N?=$DNSF%Xb(hh@1>9ezhg8WxliZYFPkAm;F zdUEsFb-p9n9`(G$%>0M^oH|IqPkQz=sH}FgK!~P1AD>FZ4LeZgB9PbLTJsvTdzn}h z4WyDDN(bBnj$1_=*dMyBI(#$O(M}9=<`!0}f7y&8H@WS*BG5e6E{Xn%z*O3$RXc!~0twVTZoE_$c~NC{+V^#dk5>ovOL z`_eZ%kTJf1NCtyZ33Opd?lJkGgw@&ON1PP%?&y(eMo>tMleYp=QyL9je;0R(lJuVklTj`4Jmby87EJs{pzpKfFg5f?A-<&7! z<*Ce(&-{EK#C#0>823|2iecN*hwwS?+6NaoK>X5)-u)?u^2$C)aYf~Q9_}Ht+l(Xt zWoDd1cmzx-Q@9(R&Uo9}1E%4T?{TyR{umEfWpX&pv}%j~oQ@8Ud|tmVh;-T)F%ItL z!NtEW>oGAS^yD&^@*c*srw)iwu*@*lfdD+&11;O*aM)_u;Cx?+z$=>sWbP#sQ-)s+ zXT2{S(#NTTmoFfd%x5Gc47;GObW1JK%vWS4)Y`yiFezL-T9#0^!QY*1By*AnH9sy2 zY7S5D_$J*^u#ubX?e7@}I06;qBE8HC|BCLKKe85lMX3})%m|lF94SE{+y-xv;W6oY z)p2|8J+w6sFgWzN;&-+j-z2YjNTL57asyAukNclouw8@LhW(eVn%BK~BNCDw)jXRJ zh_}*;G-ejbITOls9t(P?nqM$hx38~3*Im>|?n+1W*z-z$|6MgdUH(~R&pW{z<y5_H<)N1FOtBP$Z?We=+*9#WD~!;h zOC1?*AKmzYXpv>)3Bd^JNBUO{nP>k#n@R>w)t0NeH1*z+s`YhHkRE%!l;2+Zwm|Ll zoUJ*5jW433yGfWiGIs>P>87)EiLFkIffIdspB0?GPhyMweM^cLHBcd zraX2tr)rp-I7qEx1!%DTlCowYd`PX82;7@94Rrq@Eay_2xkN(qIaMjmj@@!{@na+U zddMpi3eZ6J7at~~!0D3LUR{8SNqn>;7AMjM>g!$5B zGS_k5B(O!0&HTr}t)&i?TG#%tWPa-+mtH1fGKBX$YL1>8dh8#F^pgVLp#6rGm-tl(toP)CLpn+ z4TON4Om9MXli0*Opwn$&fw3wo1?XedV|vo;5UGn>)tbKc#cTO9C3(Be$8&AOk$a6z zdft3SVnq~x@cClnj)2LCz2%h{Is8Avxy8Gj)V0Y^5ttlN^cqAE2UN79U19^~a&ruE zvxJkyTL{T(eLnxxpGO?F%C*iduBvR6<`tIy=fJD=yc`&ug4fXVn!>*(Iz`i+v%vLF@VoW| z$f_=8NO8kNK@!$En@1~ejS~;;sohS6A;_p>E0z4i++(&9V0^$OzSsZ%W~k`lP<8mw z6wGh{mrDriolYw$^`(8a-_YhWAZc(0ny-Ci*Y9&GN@HkVf~za7rC!qNLZ)S&%(8kt z4bFd-X1&SOucJ6}vNK`JWAv8?mW57Q*E2eeZ>2|N(jvuo7u5$x;6bv1)r1rE^SaWd z#$D?OG@ppSa|qtOQG02QfGcjVjG5|eUO2wNv)~a>qW4|TS;@oN_GRD^0_z8i)htB1 zNfpd4;BmnA^A8zqOG+?s#hF~w7jLRS!)vOWkCZX)8)G65e^89KL97$#X{)>tGLeFy z5blaen4Ra6TYuXX+tD;4;d^<5PE>;eX5}q=t>;=<3ivvkx##|^0Jn7U`scs8UYi?u zEMI$=woa8kUnge7x!If0!7cA@+7is+WsVz<$*W5J7$7s ziX&5GhQYWHjochs{4y-)?2)oLRrWbRfI8qJ5M7FXL{4pY$)@vtHiUl zCEmS_4~o^)IHxyp#a5gq3K^Mexd%9@L7{u#9s-2-BsTb3PC7P3i^-Q@j2LG|3#f@R z#FBFaQ7xxmXkUo^jtKN4b`Y8uWdMB2;#?ktB03jfOX-Fys(EE?MnGwK(~9TV9@+t} z%|TA0mk$cA132aZkineSS}xEPYs)aC=}S!DmD>sH9i|_pc37ej6^`XlAoIyuw*EEf z-YQ>p zvcx3=eUMq;G|O#xXdq1(lT2bX=(16Hl#*pNz($CL*JYCl8R z0ZUIjr8%RRR?&m4qRZ^Z!L^xSBVK|%QD|iInURa!qW+bA1yCGO)-HqqK|;`=fx#gV z7zT#H0>KIH?l3_HXK)J|T!LG0CU~$waDux_aEIXTgg4oJ`)aFptN#C0b$|Ev>C=6? zPSD%4+=<~eN;fHI}3EZzM&$~t3o?r#h$<<9w zJQ!roy(Y!kHzVW6G40|L#hphaH$2BS!&ev%s3Y^uRA+vyDXa1k$YxAi#A-Mj_YS>u zcCAdpe^m;_F32?egzotIRhr#e7}m#mO?nX?(fA_-v^f<&oACPA7jo|RuEA36obQFE zI*H`iuxzk9aY{oWB37eG{rI9h7?dIdq_{y{+LDjN^4xQisy)R=ME&vkL!zd}Z%F-~ zs?Xo|>h>u(<$_&!vo5it=rvIYnJLVZGEkY`Tk)BQod@g*uJPezrBX8MwXrlK?J^|OKE}oc1w{e_awc|BFjEuGyqb3DmP~~_C#Uj%vk7^gt)58{qUs6bI zZgqPLAr$i2CvGrA+28CR9xY;Q_f&PP&u4}>c7m^X?aH zVN*a%4QG0k#Sjf_xo!@A&eeJCY(#R_7~od}s|D+`$-a5t-lI(84n4kLM2OC&v*T)t z71fYt(sLBG%_pbW657XyS2RWAOY`Fiv$_t!EV6SD0n{H<>rP=Hd@rIQvk6#rl+=g} zn)Pl&`|DTZ3=tTai3cVh=p*C0YioYp%crJt%;8+ShME5s8!kyI5BMEepK$*RrEDS` zye3*1pW~Z1d$d5Zr^{_xYs4tBYpq;TleE@{dr^dpF>y&^YJ+h)11CtC?N*1?qz00* z9Nb3CEJd57`X92F-bj+`z2l(ZVL~8en387J&L+v8qARek+ov5avmcDUzBAFhZ+{&EXYZO*vU&)$}?OIdtwC{DN z&CrX_PAzY$jbK0(^@%enK&HnUO3eo}g&>kxb*4IA$7gGKS)-92C1zp;p;|TXq+zOl zvd*89$DDRA4VhSyP0Pk~Q%1sY0}%WeF8 z;TObv)Y|T<85uvP%amQxFr~@@4`=xIo^yow^`+#X(-?=}y9T4ZMR!r|oI397S8_E% z`Sc8a>Hp5Gh{BUSE-Xsggqm_-6UF*qmLbJVAy8_ z$Oh1&N5A`tCvIlTC}`wZWqQzjW#iY-A$bZ|NvKVGYz&Fz? z(X`XF437>};k}wL3r$b9x4w0>j*?*xg3g9t{krx*f30|88B)V#r*Rm$%utA>a}8 zhu4NEDgzjYW1pxDLNy+ig5%zM6QbXXL^>3+_P zZr-orY(X|C`jL-UD((pUmE9`!I@s0^ui-c_bZV@NTaM;17V`DG9>#6v{M})+dFwwPA`4JTd zwv%gCUd_-$lIah8uqID`n*YSgpvc-|_daF8c{)>E2;)Xs%VJ$ZkKx67&db{wU2>?| zOGe3tqpi1SHa=G#Qvt`ZMLrlnT?M)yzn$==xph0<0$|!2EAvHuGsZ^#PZ(~|*I&{V z$MX2j;VodA|4vba&;={#fJ{awEyfwSfE;0FlzVBQ03j+f;o}DIQtTk;y4P`qQHz~z|_xSDc zAI#L4EtdMttaW!7&|zb2DOE9@zrV?()Oq8opuK4pZvT)&xkL+5D&AQ&Y)dKmsr#)W z+zX~W|EB7vbtjcOrz)N+r}V?jxOet(>oza5e|`Bq0r^m*Yz}2dKxv?(dkV@?6kG>D|7cgVH9_RBfwfq@?yHt{YyO>fo zNzK-NB-I}C4*kL!HLbQK{@v+E1*fHi2l7`rEGmcxY?c`K=1(nWh;zrt38U1e1DNr; zOvu{LdUM&@FR-ur4osdvnbQ&(Ij2EOw`URQs3;y@L~p=A^=9E$`N`;SnqJXT z!zW&|1ec_z+nxdkln2!g8Vt8b6YuSqq>WimN1v3TL%OlZG(zR@>Mp+#>R z_swHab|S70we+3n{D;lYl)|T=L_=md&dPXk@wVy{gT{xKMp-}Bhs_fG;tLpz(he1& z%YE&eL-QDHk|`Qp2Ggvu=C{+KQIn{-Fy5{FgjswtV!FD5>bMWxf)Nh~etF7Xg%szy z${QoE??$gyN>Bx#4%Xk|kGmTr&1nR+_9|E!>{s#{hAG=^Y|p87gyHvaD2DNDIg zVytAH9{yHL;`kh^A0!7Jj|Uc3UP~aP5^e;iUK)aOH6ZD<(WQjW;d-PHro-WTO+l%N z_5fgEzpc>j2xCB-kOI~jS{qRkn>zSDb~ubQjh_NHwA)HQy%_zRDKgKT?66-zrpm*q z!$as7xXUs7v?-P!l%_SxwOjid2TLLe53d_XTLd%GUJ`_`fVZ`_Hb~94U(c2ekOGMK z<-Tg;d@Bj^elGSqMb>{A+cnMtGKriotF+Cv*kQr=1~%CvtAR1g@x8UUhFY9EWOw7L z*MA|0A@k{qvNLr|Ckty4a#BRHF8XaWb7a59hwNBF(e__|P5t#bPgjTqGEuE&l=xXC zYU;gua&^u(;o89*P%N)q+aZ;J%i9?=T1KiQAS>0Bkxm z_O631DTk#jIv>PBhX&)aY(2}@`FKM38Q9PL8y5u}dv)E5*K+rU!9C^P_!A1s$=3t| zHWaU+w)VyMre_YnPkeDdv`&nAVG0o;azrse9y~6h9Ew}pPRy3NMp<;e%)gn=cD!#N zF%7f6=Khx)vb_J1Ll(dX=Kn`#SO5TO4*pep%PFtdvhqZurRhW+zD!8 zi|syVtD#`O2E=dP)^dN5IEU7y*sB#Bn-L^K<*JuxU8^M{Pl`oI{BG{g-9>t8O{2yO z|Lv3XlXTLamYd6(FSFm58{We!rm6+6W^x?kPw%#f7Z%igoZmJao_740u3g{15qbr&18e3fgC=5El!YNXOt;=%r3r3Dbu3p6l)!9qvn)!kVkNFyUnVew?xpyCS_h?39w3t5l@BfqyTieMCp( zASdIZXEeV0Pmq(2U%-;{bQWRO1#_nbGV}X(DR-CPu#Sbqk5cYj{b{S64%4hRo|S9d zo;7RExa`72(%#*=n%>$kMUX;@qywVyAA68?Jjo-mG^*-lB7ox*IVv zYL67NcUqL{g^sL{u5n)nU*o$gQWxH5ckwqZRL4DoYqZK1<4|q)(a7ZI? z<+yevc;zdZc}rBalhbJ1q{QV}8raJ0x2*kNs0#USRE7M1Q8f=q)ibuHPqGKJQO{RP z!s*DLsp8FR+Sbm&uoX#J{X-c({iQ1T+(9ws=-Ktq^-#yGmI!s0#rIc+i|h?kyM37< z1GDuyH_l$T<>kdWKdAKmTHMNA4tTs8)MD00g&M0HI3QDnN5haWO~2D03S5Zxyi`}0 z8veuwxJ+I|+`V-kvhgE}y0kZbek$j-XYI988_bGiDUKg}# z!rFRaw%1+mr#53$a8tBBa8mCUn*;x*TQq&(ExG*oaJe#hKTj>f$9W`;5Tl@+p295? zd{q}kS@)2RlcZZQSyis(zc>H5CEv+f(HPcvsc&$n^)=6%g#Gzv1sPN|I(+(A2f(TS zC<=XcZI5uQ2h|K!uQvddzMx+Kl^&CRgaF7&#;93TAJWra^(1dC0Xq^5&SCwAkta2l zre;lC7N?p_^-k7+whb&Kb*F|*FuvP(Q~j!ojuS7qmR6tr<5)f3z*9at{dVNQN^}=U zV9Omn1Ngm-b-S>QyeBNn+3vXwKOVxReUA{a;f|_bIrG(uZdh{RTS$O|ER_OHs87^N z*y=3-4_SE@gFBEzr02Y#WQ1!<=dmM;?_!3`owf;wl=c99+N4c^$Lu`5-|nL{xrc32 zEj4BLu>cOaxZX=HsRx^O_viA4l&5rZWO=?xAST7`2XS#G!_kI~V4b$)Qe)g9ojx*Z zD2Z9I(nC6#7hMUtxW(|TS!sIpAF}(5iwWv}9Y5T+{Ci3)$rsE!Z!I2JPjvjmyr)ye zoQDO}yw@{aq!I=M==-6WA-?RFzj8_0g!81?gnQclLCR8Sm&Lvuo3NPyn{bB`+h6WV z9cg-p__7I$BS^Bw8cDKN(@C<5TCQ;)km6x6@9qX9L49e31C2sQM%6+`5|HP+67yzi zxt70=RKob_^fD(|>|UBP_&UZ%JM>XK{*goud?Uj8HToKvBLdcSoU`Umwc0+$;2h*p zUc4EUON1p2=17Xs%l;9W*9zFV%yFh+_MK^r!P(`bij#b?XfvS^J|_O^9^E>2QR=@C zmG3`^3IPKD*F<$uSFrzgqS_JyHHfj!Xg9yvPN|nP7p}nL2tz+o-5$-@hwzkGd$oB| zFAWNb_$}PKFAa}3csW-e8-|lY;M*-u)%)9rNT0f4;3C}M>J2Yk^B!MmQxY(KZ_P`(&^;rH3)NWrThJEq>prPX3R`@<)*ihLR?I#EB{?m&c32fi?S zO9(pSAE$>=_(7oZm8vWa%bdVowDutWAhW2cZ(P0oTtk+67{n4c;iZdO)I?7^OjqT|G2qj4?a8R?@l+1-nA57NP`+qnycI}W?n@H3IR?TOzj0Eu4X=9jhecWFzX zdKSmRZXj$A=FPqt8KXQ}rDeO;kPO$vJ-35K98J~ou448Xf7l-9qbc%@xzZ&5B=96K zff|#o`goT8A-g~dFhz}E<@vf}&HA%a=V!2_HI#1=B;a^ZWwL?)2orEx>$q;Q^+udy;>z&&!2u6qglc%#z zMr-C&b#45=a+TGrx`FRdywxd(qrpXZm9=oyC<UKpGuGb*G;ME zwb{lLB(L0;!slY)m|Vs&frZ&LCcC3E=q8d}*l^T#mLBUBQ?sY}{zWJGX|NCSlR(yK zW9;r@y444nxODbrC|Pu+PylwrBVbYTJ~( z?$wQMW&_e^q)!AaCUg1t71An^OX6YGq=_3V=D7z%8AjCIHv}dps`(HwV(R$Y>pEXKrt5jSc#@ zTcPgZ00nR=nOdvCZLv8O0o;(kvA&Zt901})R^_2~7H~@d7Z?QkXSek~KM6Ez={V2x z;``22wB;>EKU-t60HvO&)TZiI_PEaRL{nRaCGjUpB<2&|Enn$4BT_MR83rx0O+RFq zE1Q~i<(q0+SsHVM2n=Kdv;^^EdU5?B1ielO^TK{@gWIK?#e(T9FKZc3w)-@KjN(Pe z7h-B1mH>*??DriN#E)1usB?>f&6HV?{d9g;a(d!dXn+CZQe!M`>>F?DsXX7PsHLXk=lH#OI zl-Av&i(`5OPo~Pu)`>=iC%D=NSd$h(XrqYYcup0RRXL&sY1zV*@BLDU)5 zOWlKBjv`KdGT^a;iB4w%R)bVFjRf=t!k@vNJllre89s$2BXw>zcR zH6fbh8{8_N?k6fEDwRlUO7;`il`uNRY=*hq%qNQu9lhy?`0oh|-Y|9!r%8X}Dr>%A zu7lSz6Z=b5t*4}Lyg7aES*~&1!Nf&ZB}GU#7MkhMA1(u=e@2l1raVV|*0p)lx|^4HL(5ly2-Jn^1F!7^Y+s* zPhR1TD2W(Vphev!-@w5G#iXW^dZWp#!~QRcOBTYxOVK<_o#d_f`HRsiVFzR}*IXj2 z2=6u#z>>3XJ14;fc&%l#&Ry^{B<7|*BN1oxMrfUgp%b>|c3-~Y=Du;0uzYscv+2{a zg8n$a4fBsojL6_be%*Pc<2w_ivI$N0&`2QqGg54G+IKL>=@$Pn8$Q#8rn=L~kbxADbf2!bei0CZ~=TV>wmP=i@P+ z#o{*VcC>zX4i@=OB7D_PUR82D<*LGMWj!J*Pd3EQ-zC74dSl&^1hQ@`N6s=wZ2ak5 z^sEeWDd2k1drgr`Nf^+1gI90MbkzrTzn`d?=E_G`3tYYJi1-9Y%$|@1YuYQTyNH8B z$h_{^Vaz*Uz3xX$#}{TM4?HcmS!#cLaWwGNC0;S{Iygh@fklk?&AiIfe6)gn`lhaX z${QYMG-}t`hIEZi-dApyc&))sC$+3sVY{nZlVMlaM_u`L^Zm_Y{6^qvQ&Hn1JRc4B zhm9f6@!!9ZFW3+oAqN}M=Oi1_2HVTB2I@Bu@7;ZOQidsARUx*UxLZ}LFL9BA@8kAm zis5XC%rSh462$XdPQfU67(FW{Vzb_F%3Q;_*W2z_yHLrlc#;j154>IM(+y!!K{+3M zN-)Z)p-1eB&AgXigl$W*+QN>Kf6fioSW8kkrJwud_QEVmYb93L;0kU(upcn9E6%uc zT$bHcvh=nDB^Wo~cdJ0C?^rW&!tiU$A)14M$=+LR0_!Q;6&{-0=o^HCWL@_?j(5HV4(lqg7yAHpLJ zk>KS9gG52Xfd5$q*~-7X8rT1ReCyQ^KOw3LO3>;d;)wC*^BrFY~ld1OdpH+5`w5Qyn(b~vl`SF;fQwB9i>N;Y_`X&9~hqih_JNke+$GXCPz0UpbLn${U^;skehLK+hc0Ry?Q L85t#&q_F=FB4uQ; literal 0 HcmV?d00001 diff --git a/examples/notebooks/intro/my_utils.py b/examples/notebooks/intro/my_utils.py new file mode 100644 index 000000000..9a6477dfc --- /dev/null +++ b/examples/notebooks/intro/my_utils.py @@ -0,0 +1,55 @@ +import os +import requests +from humanfriendly import format_size +import pandas as pd +import glob + + +## Reads parquet files in a folder into a pandas dataframe +def read_parquet_files_as_df (parquet_dir): + parquet_files = glob.glob(f'{parquet_dir}/*.parquet') + + # read each parquet file into a DataFrame and store in a list + dfs = [pd.read_parquet (f) for f in parquet_files] + + # Concatenate all DataFrames into a single DataFrame + data_df = pd.concat(dfs, ignore_index=True) + return data_df + + +def download_file(url, local_file, chunk_size=1024*1024): + """ + Downloads a remote URL to a local file. + + Args: + url (str): The remote URL. + local_filename (str): The name of the local file to save the downloaded content. + chunk_size (int): The size in bytes of each chunk. Defaults to 1024. + + Returns: + None + + Example usage: + download_file('http://example.com/file.txt', 'file.txt', chunk_size=1024*1024) # Download in chunks of 1MB + """ + # Check if the local file already exists + if os.path.exists(local_file): + file_size = format_size(os.path.getsize(local_file)) + print(f"Local file '{local_file}' ({file_size}) already exists. Skipping download.") + return + + # Create the directory if it doesn't exist + os.makedirs(os.path.dirname(local_file), exist_ok=True) + + # Stream the file download + with requests.get(url, stream=True) as r: + r.raise_for_status() + with open(local_file, 'wb') as f: + for chunk in r.iter_content(chunk_size=chunk_size): + if chunk: # filter out keep-alive new chunks + f.write(chunk) + print() + file_size = format_size(os.path.getsize(local_file)) + print(f"{local_file} ({file_size}) downloaded successfully.") +## --- end: download_file ------ + From 41e1d525b876868a4712dbe4e6af94d844914b6c Mon Sep 17 00:00:00 2001 From: Sujee Maniyam Date: Wed, 16 Oct 2024 22:56:31 -0700 Subject: [PATCH 02/10] DPK intro example v2 Signed-off-by: Sujee Maniyam --- examples/notebooks/intro/.gitignore | 10 + examples/notebooks/intro/README.md | 5 + .../notebooks/intro/dpk_intro_1_python.ipynb | 1919 ++++++----------- .../notebooks/intro/dpk_intro_1_ray.ipynb | 1392 ++++++------ 4 files changed, 1409 insertions(+), 1917 deletions(-) create mode 100644 examples/notebooks/intro/.gitignore diff --git a/examples/notebooks/intro/.gitignore b/examples/notebooks/intro/.gitignore new file mode 100644 index 000000000..89b9e565b --- /dev/null +++ b/examples/notebooks/intro/.gitignore @@ -0,0 +1,10 @@ +output*/ + +## File system artifacts +.directory +.DS_Store + + +## Python output +__pycache__ +.ipynb_checkpoints/ \ No newline at end of file diff --git a/examples/notebooks/intro/README.md b/examples/notebooks/intro/README.md index 53d21433c..07b63f513 100644 --- a/examples/notebooks/intro/README.md +++ b/examples/notebooks/intro/README.md @@ -4,6 +4,11 @@ This is an example featuring some of the features of data prep kit. ## Running the code +The code can be run on either + +1. Google colab: very easy to run; no local setup needed. +2. On your local Python environment. Please follow the [instructions](../../../README.md#-getting-started) to setup + ## Intro This notebook will demonstrate processing PDFs diff --git a/examples/notebooks/intro/dpk_intro_1_python.ipynb b/examples/notebooks/intro/dpk_intro_1_python.ipynb index 6f4cf757e..1049bf8d6 100644 --- a/examples/notebooks/intro/dpk_intro_1_python.ipynb +++ b/examples/notebooks/intro/dpk_intro_1_python.ipynb @@ -13,7 +13,7 @@ "\n", "Here is the workflow\n", "\n", - "![](https://raw.githubusercontent.com/IBM/data-prep-kit/dev/examples/notebooks/intro/images/data-prep-kit-3-workflow.png)\n" + "![](https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/images/data-prep-kit-3-workflow.png)\n" ] }, { @@ -27,7 +27,7 @@ "\n", "Two options:\n", "\n", - "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/main/examples/notebooks/intro/dpk_intro_1_python.ipynb)\n", + "- **Option 1 - Google Colab:** easiest option. no setup required. Click this link to open this on google colab. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit/blob/intro-example1/examples/notebooks/intro/dpk_intro_1_python.ipynb)\n", "- **Option 2 - Local python dev environment:** Setup using this [guide](../../../README.md#-getting-started)\n", "\n", "The notebook will work as in both environments" @@ -45,7 +45,7 @@ "We will use simple PDFs about Solar system. The files are [here](https://github.com/sujee/data-prep-kit/tree/main/examples/notebooks/intro/input/solar-system)\n", "\n", "- [earth.pdf](https://github.com/sujee/data-prep-kit/blob/main/examples/notebooks/intro/input/solar-system/earth.pdf)\n", - "- [mars.pdf](https://github.com/sujee/data-prep-kit-examples/blob/main/data/solar-system/mars.pdf)\n" + "- [mars.pdf](https://github.com/sujee//blob/main/examples/notebooks/intro/input/solar-system/mars.pdf)\n" ] }, { @@ -71,7 +71,7 @@ "base_uri": "https://localhost:8080/" }, "id": "1fe354b7", - "outputId": "0a38a7b5-238e-433a-c378-78444908aa8a" + "outputId": "5c153f72-08ed-4d6e-ccc7-dae851e7fd8b" }, "outputs": [ { @@ -112,15 +112,15 @@ "base_uri": "https://localhost:8080/" }, "id": "3309799e", - "outputId": "9b44b764-d284-4da1-ad55-f08d5c9c0f89" + "outputId": "99530315-6dd5-405d-dbde-61e2332e441b" }, "outputs": [], "source": [ "if RUNNING_IN_COLAB:\n", - " !mkdir -p 'input'\n", - " !wget -O 'input/earth.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/input/solar-system/earth.pdf'\n", - " !wget -O 'input/mars.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/input/solar-system/mars.pdf'\n", - " !wget -O 'utils.py' 'https://raw.githubusercontent.com/sujee/data-prep-kit/main/examples/notebooks/intro/my_utils.py'" + " !mkdir -p 'input/solar-system'\n", + " !wget -O 'input/solar-system/earth.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/input/solar-system/earth.pdf'\n", + " !wget -O 'input/solar-system/mars.pdf' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/input/solar-system/mars.pdf'\n", + " !wget -O 'my_utils.py' 'https://raw.githubusercontent.com/sujee/data-prep-kit/intro-example1/examples/notebooks/intro/my_utils.py'" ] }, { @@ -138,7 +138,12 @@ "execution_count": 3, "id": "1fcec577", "metadata": { - "id": "1fcec577" + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "1fcec577", + "outputId": "0f77fc39-ffeb-48da-ce6f-1750d8d3ad62" }, "outputs": [], "source": [ @@ -146,8 +151,7 @@ " ! pip install --default-timeout=100 \\\n", " data-prep-toolkit[ray]==0.2.2.dev1 \\\n", " data-prep-toolkit-transforms[ray,all]==0.2.2.dev1 \\\n", - " deepsearch-toolkit\n", - " " + " deepsearch-toolkit\n" ] }, { @@ -195,7 +199,7 @@ "base_uri": "https://localhost:8080/" }, "id": "e4YMZrBuFycl", - "outputId": "42a9edae-205f-4dce-cd4e-a159bd8f620b" + "outputId": "d7ee9449-4f21-4c9a-fa54-14b7f28d764a" }, "outputs": [ { @@ -222,23 +226,9 @@ "execution_count": 5, "id": "33345487", "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "33345487", - "outputId": "79b40d76-b4dd-48ea-9638-461c78a637a1" + "id": "33345487" }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "MY_CONFIG.RAY_RUNTIME_WORKERS: 2\n", - "MY_CONFIG.RAY_NUM_CPUS: 0.8\n", - "MY_CONFIG.RAY_MEMORY_GB: 2\n" - ] - } - ], + "outputs": [], "source": [ "import os\n", "\n", @@ -248,36 +238,13 @@ "\n", "MY_CONFIG = MyConfig ()\n", "\n", - "if RUNNING_IN_COLAB:\n", - " MY_CONFIG.INPUT_DATA_DIR = 'input'\n", - "else:\n", - " MY_CONFIG.INPUT_DATA_DIR = os.path.join (os.path.abspath (''), '..', 'data', 'solar-system')\n", - " \n", + "MY_CONFIG.INPUT_DATA_DIR = 'input/solar-system'\n", + "\n", "MY_CONFIG.OUTPUT_FOLDER = \"output\"\n", "MY_CONFIG.OUTPUT_FOLDER_FINAL = os.path.join(MY_CONFIG.OUTPUT_FOLDER , \"output_final\")\n", "\n", "## Embedding model\n", - "MY_CONFIG.EMBEDDING_MODEL = 'sentence-transformers/all-MiniLM-L6-v2'\n", - "\n", - "## RAY CONFIGURATION\n", - "### For local runs, we can use more parallelism\n", - "### For google colab, be conservative\n", - "\n", - "if RUNNING_IN_COLAB:\n", - " MY_CONFIG.RAY_RUNTIME_WORKERS = 2\n", - " MY_CONFIG.RAY_NUM_CPUS = 0.3\n", - " MY_CONFIG.RAY_MEMORY_GB = 2 # GB\n", - "else: # local run\n", - " num_cpus_available = os.cpu_count()\n", - " # print (num_cpus_available)\n", - " MY_CONFIG.RAY_NUM_CPUS = 0.8\n", - " MY_CONFIG.RAY_MEMORY_GB = 2 # GB\n", - " # MY_CONFIG.RAY_RUNTIME_WORKERS = num_cpus_available // 3\n", - " MY_CONFIG.RAY_RUNTIME_WORKERS = 2\n", - "\n", - "print ('MY_CONFIG.RAY_RUNTIME_WORKERS:', MY_CONFIG.RAY_RUNTIME_WORKERS)\n", - "print ('MY_CONFIG.RAY_NUM_CPUS:', MY_CONFIG.RAY_NUM_CPUS)\n", - "print ('MY_CONFIG.RAY_MEMORY_GB:', MY_CONFIG.RAY_MEMORY_GB)\n" + "MY_CONFIG.EMBEDDING_MODEL = 'sentence-transformers/all-MiniLM-L6-v2'" ] }, { @@ -316,7 +283,7 @@ "base_uri": "https://localhost:8080/" }, "id": "60ac8bee-0960-4309-b225-d7a211b14262", - "outputId": "5c305d54-1c91-455d-d0e2-b514b61a068b" + "outputId": "4d5511fb-1c6f-47df-e5ea-2c1b354d262f" }, "outputs": [ { @@ -338,8 +305,7 @@ "output_chunk_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '02_chunk_out')\n", "output_docid_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '03_docid_out')\n", "output_exact_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '04_exact_dedupe_out')\n", - "output_fuzzy_dedupe_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '05_fuzzy_dedupe_out')\n", - "output_embeddings_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '06_embeddings_out')\n", + "output_embeddings_dir = os.path.join (MY_CONFIG.OUTPUT_FOLDER, '05_embeddings_out')\n", "\n", "## clear output folder\n", "shutil.rmtree(MY_CONFIG.OUTPUT_FOLDER, ignore_errors=True)\n", @@ -381,14 +347,14 @@ "base_uri": "https://localhost:8080/" }, "id": "482605b2-d814-456d-9195-49a2ec454ef0", - "outputId": "90eb1f89-35d1-4b6f-ea34-7667680dd256" + "outputId": "c50847d4-f2c7-4559-f5f7-d6a3d025027d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "🏃🏼 STAGE-1: Processing input='/home/sujee/my-stuff/projects/ai-alliance/data-prep-kit-examples/dpk-intro/../data/solar-system' --> output='output/01_parquet_out'\n" + "🏃🏼 STAGE-1: Processing input='input/solar-system' --> output='output/01_parquet_out'\n" ] } ], @@ -418,49 +384,49 @@ "metadata": { "colab": { "base_uri": "https://localhost:8080/", - "height": 625, + "height": 657, "referenced_widgets": [ - "8226b2522ce446f6bd3a36c4e227370c", - "7616f1b493e1461c9fd1319fae3bc10b", - "4f63bfad92b64e7bae18e720376d402d", - "6957a659451b46dab702c1c62fa9cdd2", - "2eea7bc810e54eaeb325136352b71e66", - "ebc626c0750c470db6789b26acf15f60", - "3077f04af3a9447ab98717bd3131cd8f", - "709685da1c6c4164bed658357a2191bf", - "0a1ed94698ca4e4291c553929e0ca66c", - "5dbc6889a9c243c5a922f8cc5f1a704c", - "d6e520e4da004c818031ccfcc3588e5d" + "97b603697cfa4b4ea4e6735b6768ca35", + "e87e8d3262c54cfaaa8768505edacda3", + "b78aa40816e44f7fbebcb24ca68818b3", + "7053c9606a414e978636a7e241909504", + "da0787b239764847a731083997780a85", + "553f3c16839a49d79591d0fc4862bed6", + "c0eb5bc8f6ee427ca42204b3c56f9a4e", + "9d184ed175f0403fb03c2e13dfd04e0a", + "724778729161445c98b187031ae4f67c", + "1cb3bbf7d724411cbe9831543a4aecc0", + "06f9b33494984e4885d5aad813d1d2bc" ] }, "id": "b0cd8ebd-bf71-42d6-a397-8df0c7b66a26", - "outputId": "e2c85b44-f605-4817-c120-2cdce79e3c84" + "outputId": "01d207fb-983d-40b2-e5f6-e38e3789110a" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "18:40:02 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", - "18:40:02 INFO - pipeline id pipeline_id\n", - "18:40:02 INFO - code location None\n", - "18:40:02 INFO - data factory data_ is using local data access: input_folder - /home/sujee/my-stuff/projects/ai-alliance/data-prep-kit-examples/dpk-intro/../data/solar-system output_folder - output/01_parquet_out\n", - "18:40:02 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:40:02 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", - "18:40:02 INFO - orchestrator pdf2parquet started at 2024-09-18 18:40:02\n", - "18:40:02 INFO - Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}\n", - "18:40:02 INFO - Initializing models\n" + "22:43:02 INFO - pdf2parquet parameters are : {'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'double_precision': 8}\n", + "22:43:02 INFO - pipeline id pipeline_id\n", + "22:43:02 INFO - code location None\n", + "22:43:02 INFO - data factory data_ is using local data access: input_folder - input/solar-system output_folder - output/01_parquet_out\n", + "22:43:02 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:43:02 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']\n", + "22:43:02 INFO - orchestrator pdf2parquet started at 2024-10-16 22:43:02\n", + "22:43:02 INFO - Number of files is 2, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0551910400390625, 'total_file_size': 0.11101436614990234}\n", + "22:43:02 INFO - Initializing models\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "6454e0eb538145aebeed98e2ec662b22", + "model_id": "e92bbc86f5e34ee4ad7dd853a5136c01", "version_major": 2, "version_minor": 0 }, "text/plain": [ - "Fetching 7 files: 0%| | 0/7 [00:001\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", " \n", " \n", @@ -622,12 +588,12 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", " \n", " \n", @@ -640,16 +606,16 @@ "1 earth.pdf {\"_name\":\"\",\"type\":\"pdf-document\",\"description... 1 \n", "\n", " num_tables num_doc_elements document_id ext \\\n", - "0 0 11 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "1 0 11 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + "0 0 11 07bc0c9a-f863-48e3-9aed-bd289af040bc pdf \n", + "1 0 11 e141f7a4-3e45-4f04-88d3-60e0a81b195b pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", "1 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \n", - "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "1 2024-09-18T18:40:06.831334 0.857239 earth.pdf " + "0 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "1 2024-10-16T22:43:07.205350 0.921915 earth.pdf " ] }, "execution_count": 10, @@ -700,7 +666,7 @@ "base_uri": "https://localhost:8080/" }, "id": "f870e624", - "outputId": "f70bfa9f-62f8-417d-d91a-30c1f024ccbd" + "outputId": "0b4c054f-3a8a-4db3-f32f-17bd1466b102" }, "outputs": [ { @@ -852,7 +818,7 @@ "base_uri": "https://localhost:8080/" }, "id": "e1a10c2d", - "outputId": "300e7688-692a-4039-c4a4-a86887d9138b" + "outputId": "c1d992c2-faa8-40cd-c375-857970201daa" }, "outputs": [ { @@ -1026,7 +992,7 @@ "base_uri": "https://localhost:8080/" }, "id": "305f00a3", - "outputId": "a787385b-214a-41b2-975d-0d3c5529c2c4" + "outputId": "dd511f34-bab3-4dde-d938-493debb02e5e" }, "outputs": [ { @@ -1067,26 +1033,26 @@ "base_uri": "https://localhost:8080/" }, "id": "5b7b18d5", - "outputId": "cb338503-3dca-45bd-a60a-bd214843a97b" + "outputId": "e0b87171-9d66-473f-e66a-e4b6ae3c3f66" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "18:40:09 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'output_chunk_column_name': 'contents', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox'}\n", - "18:40:09 INFO - pipeline id pipeline_id\n", - "18:40:09 INFO - code location None\n", - "18:40:09 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", - "18:40:09 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:40:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "18:40:09 INFO - orchestrator doc_chunk started at 2024-09-18 18:40:09\n", - "18:40:09 INFO - Number of files is 2, source profile {'max_file_size': 0.02239513397216797, 'min_file_size': 0.02167987823486328, 'total_file_size': 0.04407501220703125}\n", - "18:40:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", - "18:40:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", - "18:40:09 INFO - Done processing 2 files, waiting for flush() completion.\n", - "18:40:09 INFO - done flushing in 0.0 sec\n", - "18:40:09 INFO - Completed execution in 0.0 min, execution result 0\n" + "22:43:09 INFO - doc_chunk parameters are : {'chunking_type': , 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'dl_min_chunk_len': None, 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox', 'chunk_size_tokens': 128, 'chunk_overlap_tokens': 30}\n", + "22:43:09 INFO - pipeline id pipeline_id\n", + "22:43:09 INFO - code location None\n", + "22:43:09 INFO - data factory data_ is using local data access: input_folder - output/01_parquet_out output_folder - output/02_chunk_out\n", + "22:43:09 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:43:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "22:43:09 INFO - orchestrator doc_chunk started at 2024-10-16 22:43:09\n", + "22:43:09 INFO - Number of files is 2, source profile {'max_file_size': 0.02239513397216797, 'min_file_size': 0.02167987823486328, 'total_file_size': 0.04407501220703125}\n", + "22:43:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", + "22:43:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", + "22:43:09 INFO - Done processing 2 files, waiting for flush() completion.\n", + "22:43:09 INFO - done flushing in 0.0 sec\n", + "22:43:09 INFO - Completed execution in 0.0 min, execution result 0\n" ] }, { @@ -1094,8 +1060,8 @@ "output_type": "stream", "text": [ "✅ Stage:2 completed successfully\n", - "CPU times: user 861 ms, sys: 140 ms, total: 1 s\n", - "Wall time: 1.21 s\n" + "CPU times: user 1.07 s, sys: 180 ms, total: 1.25 s\n", + "Wall time: 1.55 s\n" ] } ], @@ -1151,10 +1117,10 @@ "metadata": { "colab": { "base_uri": "https://localhost:8080/", - "height": 893 + "height": 897 }, "id": "d8138d43", - "outputId": "0d08e0a6-e743-44d9-b8f1-eec98b222a92" + "outputId": "fd01e0cb-899e-4c73-d50e-5f4e6f5ff802" }, "outputs": [ { @@ -1164,7 +1130,7 @@ "Files processed : 2\n", "Chunks created : 8\n", "Input data dimensions (rows x columns)= (2, 12)\n", - "Output data dimensions (rows x columns)= (8, 15)\n" + "Output data dimensions (rows x columns)= (8, 16)\n" ] }, { @@ -1192,17 +1158,18 @@ " num_pages\n", " num_tables\n", " num_doc_elements\n", - " document_id\n", " ext\n", " hash\n", " size\n", " date_acquired\n", " pdf_convert_time\n", " source_filename\n", + " source_document_id\n", " contents\n", " doc_jsonpath\n", " page_number\n", " bbox\n", + " document_id\n", " \n", " \n", " \n", @@ -1212,17 +1179,18 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.84518433, 588.96014404, 479.40917969, 623...\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", " \n", " \n", " 1\n", @@ -1230,17 +1198,18 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Solar System\\nFor more details about the Solar...\n", " $.main-text[3]\n", " 1\n", " [133.18510437, 570.83258057, 374.99838257, 581...\n", + " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", " \n", " \n", " 2\n", @@ -1248,17 +1217,18 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", " [132.87440491, 500.84011841, 477.48345947, 534...\n", + " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", " \n", " \n", " 3\n", @@ -1266,17 +1236,18 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Basic facts about Mars:\\n· Distance from the S...\n", " $.main-text[6]\n", " 1\n", " [133.2026062, 482.90710449, 237.04431152, 493....\n", + " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", " \n", " \n", " 4\n", @@ -1284,17 +1255,18 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.87112427, 588.96014404, 479.40917969, 623...\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", " \n", " \n", " 5\n", @@ -1302,17 +1274,18 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", " [133.20942688, 570.81555176, 375.57919312, 581...\n", + " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", " \n", " \n", " 6\n", @@ -1320,17 +1293,18 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", " [132.91053772, 512.46295166, 477.84887695, 534...\n", + " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", " \n", " \n", " 7\n", @@ -1338,42 +1312,33 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Earth\\nBasic facts about Earth:\\n· Distance fr...\n", " $.main-text[6]\n", " 1\n", " [133.30151367, 494.86206055, 240.17156982, 505...\n", + " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", " \n", " \n", "\n", "" ], "text/plain": [ - " filename num_pages num_tables num_doc_elements \\\n", - "0 mars.pdf 1 0 11 \n", - "1 mars.pdf 1 0 11 \n", - "2 mars.pdf 1 0 11 \n", - "3 mars.pdf 1 0 11 \n", - "4 earth.pdf 1 0 11 \n", - "5 earth.pdf 1 0 11 \n", - "6 earth.pdf 1 0 11 \n", - "7 earth.pdf 1 0 11 \n", - "\n", - " document_id ext \\\n", - "0 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "1 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "2 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "3 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "4 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "5 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "6 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "7 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 mars.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", + "6 earth.pdf 1 0 11 pdf \n", + "7 earth.pdf 1 0 11 pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", @@ -1386,14 +1351,24 @@ "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "1 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "2 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "3 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "4 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "5 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "6 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "7 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "0 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "1 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "2 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "3 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "4 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "5 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "6 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "7 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "1 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "2 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "3 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "4 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "5 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "6 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "7 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", @@ -1405,15 +1380,25 @@ "6 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", "7 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", "\n", - " page_number bbox \n", - "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", - "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", - "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", - "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", - "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", - "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", - "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", - "7 1 [133.30151367, 494.86206055, 240.17156982, 505... " + " page_number bbox \\\n", + "0 1 [132.84518433, 588.96014404, 479.40917969, 623... \n", + "1 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "2 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "3 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "4 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "5 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", + "\n", + " document_id \n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", + "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", + "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... " ] }, "execution_count": 15, @@ -1461,7 +1446,7 @@ "height": 300 }, "id": "3090c950", - "outputId": "cf9bd956-7b31-42bc-ef77-9ebded8ba08e" + "outputId": "0f4b6771-8d38-4a27-c756-21f916b23a4f" }, "outputs": [ { @@ -1564,7 +1549,7 @@ "base_uri": "https://localhost:8080/" }, "id": "d5f151ae", - "outputId": "2b48675c-328d-4d24-d689-ad77231ef4b7" + "outputId": "a4c491b2-53db-4d71-da24-4479de8d1d65" }, "outputs": [ { @@ -1624,7 +1609,9 @@ { "cell_type": "markdown", "id": "7ad1c60d", - "metadata": {}, + "metadata": { + "id": "7ad1c60d" + }, "source": [ "## Step-5: DOC ID generation of Chunks\n", "\n", @@ -1639,7 +1626,9 @@ { "cell_type": "markdown", "id": "1afaa0fd", - "metadata": {}, + "metadata": { + "id": "1afaa0fd" + }, "source": [ "### 5.1 - Set Input/output Folder" ] @@ -1648,7 +1637,13 @@ "cell_type": "code", "execution_count": 18, "id": "6ffd6f54", - "metadata": {}, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "6ffd6f54", + "outputId": "1784c80d-6309-4913-9f55-c018b978968f" + }, "outputs": [ { "name": "stdout", @@ -1676,7 +1671,9 @@ { "cell_type": "markdown", "id": "f78a51b7", - "metadata": {}, + "metadata": { + "id": "f78a51b7" + }, "source": [ "### 5.2 - Execute" ] @@ -1685,25 +1682,31 @@ "cell_type": "code", "execution_count": 19, "id": "5fc77557", - "metadata": {}, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "5fc77557", + "outputId": "db2b8670-543e-4073-9c7d-3f9ef5f4317e" + }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "18:40:09 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", - "18:40:09 INFO - pipeline id pipeline_id\n", - "18:40:09 INFO - code location None\n", - "18:40:09 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", - "18:40:09 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:40:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "18:40:09 INFO - orchestrator doc_id started at 2024-09-18 18:40:09\n", - "18:40:09 INFO - Number of files is 2, source profile {'max_file_size': 0.008135795593261719, 'min_file_size': 0.008058547973632812, 'total_file_size': 0.01619434356689453}\n", - "18:40:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", - "18:40:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", - "18:40:09 INFO - Done processing 2 files, waiting for flush() completion.\n", - "18:40:09 INFO - done flushing in 0.0 sec\n", - "18:40:09 INFO - Completed execution in 0.0 min, execution result 0\n" + "22:43:09 INFO - Doc id parameters are : {'doc_column': 'contents', 'hash_column': 'chunk_hash', 'int_column': 'chunk_id', 'start_id': 0}\n", + "22:43:09 INFO - pipeline id pipeline_id\n", + "22:43:09 INFO - code location None\n", + "22:43:09 INFO - data factory data_ is using local data access: input_folder - output/02_chunk_out output_folder - output/03_docid_out\n", + "22:43:09 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:43:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "22:43:09 INFO - orchestrator doc_id started at 2024-10-16 22:43:09\n", + "22:43:09 INFO - Number of files is 2, source profile {'max_file_size': 0.008975982666015625, 'min_file_size': 0.008897781372070312, 'total_file_size': 0.017873764038085938}\n", + "22:43:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", + "22:43:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", + "22:43:09 INFO - Done processing 2 files, waiting for flush() completion.\n", + "22:43:09 INFO - done flushing in 0.0 sec\n", + "22:43:09 INFO - Completed execution in 0.0 min, execution result 0\n" ] }, { @@ -1711,8 +1714,8 @@ "output_type": "stream", "text": [ "✅ Stage:3 completed successfully\n", - "CPU times: user 19.2 ms, sys: 603 μs, total: 19.8 ms\n", - "Wall time: 16.2 ms\n" + "CPU times: user 10.1 ms, sys: 3 ms, total: 13.1 ms\n", + "Wall time: 11.3 ms\n" ] } ], @@ -1752,7 +1755,9 @@ { "cell_type": "markdown", "id": "a9a8c1fa", - "metadata": {}, + "metadata": { + "id": "a9a8c1fa" + }, "source": [ "### 5.3 - Inspect Generated output\n", "\n", @@ -1768,14 +1773,21 @@ "cell_type": "code", "execution_count": 20, "id": "da9adede", - "metadata": {}, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 860 + }, + "id": "da9adede", + "outputId": "036db4ca-12f6-4b3e-9d7f-fa70e494870d" + }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Input data dimensions (rows x columns)= (8, 15)\n", - "Output data dimensions (rows x columns)= (8, 17)\n" + "Input data dimensions (rows x columns)= (8, 16)\n", + "Output data dimensions (rows x columns)= (8, 18)\n" ] }, { @@ -1803,17 +1815,18 @@ " num_pages\n", " num_tables\n", " num_doc_elements\n", - " document_id\n", " ext\n", " hash\n", " size\n", " date_acquired\n", " pdf_convert_time\n", " source_filename\n", + " source_document_id\n", " contents\n", " doc_jsonpath\n", " page_number\n", " bbox\n", + " document_id\n", " chunk_hash\n", " chunk_id\n", " \n", @@ -1825,18 +1838,19 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.84518433, 588.96014404, 479.40917969, 623...\n", " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", " 4\n", " \n", " \n", @@ -1845,18 +1859,19 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Solar System\\nFor more details about the Solar...\n", " $.main-text[3]\n", " 1\n", " [133.18510437, 570.83258057, 374.99838257, 581...\n", " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", + " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", " 5\n", " \n", " \n", @@ -1865,18 +1880,19 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", " [132.87440491, 500.84011841, 477.48345947, 534...\n", " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", + " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", " 6\n", " \n", " \n", @@ -1885,18 +1901,19 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Basic facts about Mars:\\n· Distance from the S...\n", " $.main-text[6]\n", " 1\n", " [133.2026062, 482.90710449, 237.04431152, 493....\n", " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", + " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", " 7\n", " \n", " \n", @@ -1905,18 +1922,19 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.87112427, 588.96014404, 479.40917969, 623...\n", " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", " 0\n", " \n", " \n", @@ -1925,18 +1943,19 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", " [133.20942688, 570.81555176, 375.57919312, 581...\n", " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", + " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", " 1\n", " \n", " \n", @@ -1945,18 +1964,19 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", " [132.91053772, 512.46295166, 477.84887695, 534...\n", " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", + " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", " 2\n", " \n", " \n", @@ -1965,18 +1985,19 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Earth\\nBasic facts about Earth:\\n· Distance fr...\n", " $.main-text[6]\n", " 1\n", " [133.30151367, 494.86206055, 240.17156982, 505...\n", " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", + " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", " 3\n", " \n", " \n", @@ -1984,25 +2005,15 @@ "" ], "text/plain": [ - " filename num_pages num_tables num_doc_elements \\\n", - "0 mars.pdf 1 0 11 \n", - "1 mars.pdf 1 0 11 \n", - "2 mars.pdf 1 0 11 \n", - "3 mars.pdf 1 0 11 \n", - "4 earth.pdf 1 0 11 \n", - "5 earth.pdf 1 0 11 \n", - "6 earth.pdf 1 0 11 \n", - "7 earth.pdf 1 0 11 \n", - "\n", - " document_id ext \\\n", - "0 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "1 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "2 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "3 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "4 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "5 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "6 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "7 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 mars.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", + "6 earth.pdf 1 0 11 pdf \n", + "7 earth.pdf 1 0 11 pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", @@ -2015,14 +2026,24 @@ "7 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "1 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "2 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "3 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "4 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "5 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "6 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "7 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "0 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "1 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "2 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "3 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "4 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "5 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "6 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "7 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "1 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "2 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "3 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "4 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "5 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "6 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "7 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", @@ -2044,6 +2065,16 @@ "6 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", "7 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", "\n", + " document_id \\\n", + "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "2 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "3 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "4 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "5 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", + "6 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", + "7 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", + "\n", " chunk_hash chunk_id \n", "0 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 4 \n", "1 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", @@ -2101,7 +2132,7 @@ "base_uri": "https://localhost:8080/" }, "id": "4c7a1b94", - "outputId": "2a135853-c54f-4aa4-ffc4-83c2bc7a68ce" + "outputId": "2f6f05bc-f6fd-4d66-ea01-ed89cd5b80f3" }, "outputs": [ { @@ -2142,27 +2173,27 @@ "base_uri": "https://localhost:8080/" }, "id": "a624b2b2-faad-4325-ac7d-53a840f564ef", - "outputId": "b9b3de92-4304-4540-dfba-a4549fa157eb" + "outputId": "74dc0b75-58b5-4c97-9965-91315e8a98a5" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "18:40:09 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None}\n", - "18:40:09 INFO - pipeline id pipeline_id\n", - "18:40:09 INFO - code location None\n", - "18:40:09 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", - "18:40:09 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:40:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "18:40:09 INFO - orchestrator ededup started at 2024-09-18 18:40:09\n", - "18:40:09 INFO - Number of files is 2, source profile {'max_file_size': 0.009340286254882812, 'min_file_size': 0.0092620849609375, 'total_file_size': 0.018602371215820312}\n", - "18:40:09 INFO - Starting from the beginning\n", - "18:40:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", - "18:40:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", - "18:40:09 INFO - Done processing 2 files, waiting for flush() completion.\n", - "18:40:09 INFO - done flushing in 0.0 sec\n", - "18:40:09 INFO - Completed execution in 0.0 min, execution result 0\n" + "22:43:09 INFO - exact dedup params are {'doc_column': 'contents', 'doc_id_column': 'chunk_hash', 'use_snapshot': False, 'snapshot_directory': None}\n", + "22:43:09 INFO - pipeline id pipeline_id\n", + "22:43:09 INFO - code location None\n", + "22:43:09 INFO - data factory data_ is using local data access: input_folder - output/03_docid_out output_folder - output/04_exact_dedupe_out\n", + "22:43:09 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:43:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "22:43:09 INFO - orchestrator ededup started at 2024-10-16 22:43:09\n", + "22:43:09 INFO - Number of files is 2, source profile {'max_file_size': 0.010180473327636719, 'min_file_size': 0.010101318359375, 'total_file_size': 0.02028179168701172}\n", + "22:43:09 INFO - Starting from the beginning\n", + "22:43:09 INFO - Completed 1 files (50.0%) in 0.0 min\n", + "22:43:09 INFO - Completed 2 files (100.0%) in 0.0 min\n", + "22:43:09 INFO - Done processing 2 files, waiting for flush() completion.\n", + "22:43:09 INFO - done flushing in 0.0 sec\n", + "22:43:09 INFO - Completed execution in 0.0 min, execution result 0\n" ] }, { @@ -2170,8 +2201,8 @@ "output_type": "stream", "text": [ "✅ Stage:4 completed successfully\n", - "CPU times: user 15.4 ms, sys: 478 μs, total: 15.9 ms\n", - "Wall time: 12.9 ms\n" + "CPU times: user 12.6 ms, sys: 5.26 ms, total: 17.9 ms\n", + "Wall time: 14.6 ms\n" ] } ], @@ -2226,18 +2257,18 @@ "metadata": { "colab": { "base_uri": "https://localhost:8080/", - "height": 358 + "height": 815 }, "id": "d824ebf6", - "outputId": "14aa660f-6f1a-4f93-9b61-5f8f8adcf3fe" + "outputId": "68f55770-c750-4607-a205-ba183603019d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Input data dimensions (rows x columns)= (8, 17)\n", - "Output data dimensions (rows x columns)= (7, 18)\n", + "Input data dimensions (rows x columns)= (8, 18)\n", + "Output data dimensions (rows x columns)= (7, 19)\n", "Input chunks before exact dedupe : 8\n", "Output chunks after exact dedupe : 7\n", "Duplicate chunks removed : 1\n" @@ -2268,17 +2299,18 @@ " num_pages\n", " num_tables\n", " num_doc_elements\n", - " document_id\n", " ext\n", " hash\n", " size\n", " date_acquired\n", " pdf_convert_time\n", " source_filename\n", + " source_document_id\n", " contents\n", " doc_jsonpath\n", " page_number\n", " bbox\n", + " document_id\n", " chunk_hash\n", " chunk_id\n", " removed\n", @@ -2291,18 +2323,19 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Solar System\\nFor more details about the Solar...\n", " $.main-text[3]\n", " 1\n", " [133.18510437, 570.83258057, 374.99838257, 581...\n", " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", + " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", " 5\n", " [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...\n", " \n", @@ -2312,18 +2345,19 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", " [132.87440491, 500.84011841, 477.48345947, 534...\n", " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", + " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", " 6\n", " []\n", " \n", @@ -2333,18 +2367,19 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Basic facts about Mars:\\n· Distance from the S...\n", " $.main-text[6]\n", " 1\n", " [133.2026062, 482.90710449, 237.04431152, 493....\n", " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", + " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", " 7\n", " []\n", " \n", @@ -2354,18 +2389,19 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.87112427, 588.96014404, 479.40917969, 623...\n", " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", " 0\n", " []\n", " \n", @@ -2375,18 +2411,19 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", " [133.20942688, 570.81555176, 375.57919312, 581...\n", " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", + " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", " 1\n", " []\n", " \n", @@ -2396,18 +2433,19 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", " [132.91053772, 512.46295166, 477.84887695, 534...\n", " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", + " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", " 2\n", " []\n", " \n", @@ -2417,18 +2455,19 @@ " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Earth\\nBasic facts about Earth:\\n· Distance fr...\n", " $.main-text[6]\n", " 1\n", " [133.30151367, 494.86206055, 240.17156982, 505...\n", " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", + " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", " 3\n", " []\n", " \n", @@ -2437,23 +2476,14 @@ "" ], "text/plain": [ - " filename num_pages num_tables num_doc_elements \\\n", - "0 mars.pdf 1 0 11 \n", - "1 mars.pdf 1 0 11 \n", - "2 mars.pdf 1 0 11 \n", - "3 earth.pdf 1 0 11 \n", - "4 earth.pdf 1 0 11 \n", - "5 earth.pdf 1 0 11 \n", - "6 earth.pdf 1 0 11 \n", - "\n", - " document_id ext \\\n", - "0 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "1 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "2 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "3 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "4 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "5 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "6 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 earth.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", + "6 earth.pdf 1 0 11 pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", @@ -2465,13 +2495,22 @@ "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "1 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "2 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "3 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "4 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "5 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "6 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "0 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "1 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "2 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "3 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "4 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "5 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "6 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "1 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "2 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "3 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "4 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "5 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "6 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", "\n", " contents doc_jsonpath \\\n", "0 Solar System\\nFor more details about the Solar... $.main-text[3] \n", @@ -2491,6 +2530,15 @@ "5 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", "6 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", "\n", + " document_id \\\n", + "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", + "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", + "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", + "\n", " chunk_hash chunk_id \\\n", "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", @@ -2536,10 +2584,10 @@ "metadata": { "colab": { "base_uri": "https://localhost:8080/", - "height": 112 + "height": 269 }, "id": "82cc9bb0", - "outputId": "2aff0a5f-8cc7-408c-e1cf-62c0b14b18fb" + "outputId": "46d9e91d-c470-4e3e-e5c8-508c534dbceb" }, "outputs": [ { @@ -2636,7 +2684,7 @@ "base_uri": "https://localhost:8080/" }, "id": "cc61dffa", - "outputId": "337b015f-3795-4c45-98a3-03ae817d4dca" + "outputId": "7fb26043-8538-48b6-80b7-16ceb818c1a8" }, "outputs": [ { @@ -2718,223 +2766,175 @@ "source": [ " ## Step-7: Fuzzy Dedup\n", "\n", - "Post exact deduplication, fuzzy deduplication is applied with the goal of removing **very similar** chunks\n", + "And fuzzy dedupe is only available in RAY version. So we will skip it here\n", + "\n", + "See this file [dpk_intro_1_ray.ipynb](dpk_intro_1_ray.ipynb)" + ] + }, + { + "cell_type": "markdown", + "id": "5370950a-2a3a-4143-8218-f9b4808099ba", + "metadata": { + "id": "5370950a-2a3a-4143-8218-f9b4808099ba" + }, + "source": [ + "## Step-8: Text encoding\n", "\n", - "And fuzzy dedupe is only available in RAY version." + "Encode text for the vector storage." ] }, { "cell_type": "markdown", - "id": "fcf574a3-b287-419c-9c86-07b828b41ca6", + "id": "85aba685", "metadata": { - "id": "fcf574a3-b287-419c-9c86-07b828b41ca6" + "id": "85aba685" }, "source": [ - "### 7.1 - Set Input/output Folder" + "### 8.1 - Set Input/output Folder" ] }, { "cell_type": "code", "execution_count": 26, - "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399", + "id": "20a153fa-fd56-401e-86be-4f7617affcc8", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, - "id": "9e431c8c-c7c7-48de-ba5f-2c4649c35399", - "outputId": "4450ed63-3b09-42e4-8085-2951e700cf8f" + "id": "20a153fa-fd56-401e-86be-4f7617affcc8", + "outputId": "41d268f5-7cc6-432e-d56e-2ba882fbdba6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "🏃🏼 STAGE-5: Processing input='output/04_exact_dedupe_out' --> output='output/05_fuzzy_dedupe_out'\n" + "🏃🏼 STAGE-6: Processing input='output/04_exact_dedupe_out' --> output='output/05_embeddings_out'\n" ] } ], "source": [ - "## Input to this component is the output of doc_id generator component.\n", - "\n", - "STAGE = 5\n", + "STAGE = 6\n", "\n", "input_folder = output_exact_dedupe_dir # previous output folder is the input folder for the current stage\n", - "output_folder = output_fuzzy_dedupe_dir\n", + "output_folder = output_embeddings_dir\n", + "\n", "input_df = read_parquet_files_as_df(input_folder) ## for debug purposes\n", + "\n", "print (f\"🏃🏼 STAGE-{STAGE}: Processing input='{input_folder}' --> output='{output_folder}'\")" ] }, { "cell_type": "markdown", - "id": "f4c82a8f-b513-4fe5-b172-d41b104b54f3", + "id": "c97545f4", "metadata": { - "id": "f4c82a8f-b513-4fe5-b172-d41b104b54f3" + "id": "c97545f4" }, "source": [ - "### 7.2 - Execute" + "### 8.2 - Execute" ] }, { "cell_type": "code", "execution_count": 27, - "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f", + "id": "228df6b2-bc62-494b-9697-03ece98d7853", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, - "id": "3864ff77-e9a8-48f7-973b-c3b3aef1a94f", - "outputId": "2baa790d-6944-4d20-f0c1-fc2979eb1686" + "id": "228df6b2-bc62-494b-9697-03ece98d7853", + "outputId": "b2119b07-0654-45cd-f729-1396e18b24b1" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "18:40:09 INFO - Running locally\n", - "18:40:09 INFO - fuzzy dedup params are {'doc_column': 'contents', 'id_column': 'chunk_id', 'cluster_column': 'chunk_hash', 'bucket_cpu': 0.3, 'mhash_cpu': 0.3, 'doc_cpu': 0.3, 'num_doc_actors': 1, 'num_minhash_actors': 1, 'num_bucket_actors': 1, 'num_preprocessors': 1, 'num_permutations': 64, 'threshold': 0.7, 'shingles_size': 5, 'delimiters': ' ', 'snapshot_delay': 1, 'use_bucket_snapshot': False, 'use_doc_snapshot': False, 'random_delay_limit': 10, 'worker_options': {'num_cpus': 0.8}}\n", - "18:40:09 INFO - data factory data_ is using local data access: input_folder - output/04_exact_dedupe_out output_folder - output/05_fuzzy_dedupe_out\n", - "18:40:09 INFO - data factory data_ max_files -1, n_sample -1\n", - "18:40:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "18:40:09 INFO - pipeline id pipeline_id\n", - "18:40:09 INFO - code location None\n", - "18:40:09 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n", - "18:40:09 INFO - actor creation delay 0\n", - "18:40:09 INFO - job details {'job category': 'preprocessing', 'job name': 'fdedup', 'job type': 'ray', 'job id': 'job_id'}\n", - "2024-09-18 18:40:11,503\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - orchestrator started at 2024-09-18 18:40:12\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - Number of files is 2, source profile {'max_file_size': 0.009611129760742188, 'min_file_size': 0.009521484375, 'total_file_size': 0.019132614135742188}\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - Cluster resources: {'cpus': 16, 'gpus': 1, 'memory': 8.208082581870258, 'object_store': 4.104041289538145}\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - starting run from the beginning\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - continuing from the very beginning\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - Fuzzy: num buckets 8, bucket length 8\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - created 1 bucket actors\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - created 1 minhash actors\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - Table preprocessing uses 1 readers\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:12 INFO - created 1 table processor actors\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:13 INFO - Completed 1 files in 0.014 min\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:13 INFO - Completed 1 files (50.0%) in 0.014 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:15 INFO - Completed processing 2 files in 0.047 min\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:15 INFO - creating minhash snapshots\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:16 INFO - minhash snapshots created\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:16 INFO - creating bucket snapshots\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:17 INFO - bucket snapshots created\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:17 INFO - created 1 document actors\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:17 INFO - created 1 bucket processor actors\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:17 INFO - created bucket processor invoker\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:17 INFO - added invoker to bucket collectors\n", - "\u001b[36m(BucketsHash pid=1191796)\u001b[0m 18:40:17 INFO - processing buckets 0 long, 53 short\n", - "\u001b[36m(BucketsHash pid=1191796)\u001b[0m 18:40:17 INFO - Done submitting long buckets\n", - "\u001b[36m(BucketsHashProcessorInvoker pid=1192188)\u001b[0m 18:40:18 INFO - Waiting bucket processing completion. Submitted requests 1\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:18 INFO - Done processing buckets in 0.011 min\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:18 INFO - creating document snapshots\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:19 INFO - document snapshots created\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:19 INFO - Completed 0 files (0.0%) in 0.0 min. Waiting for completion\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:27 INFO - Completed processing 2 files in 0.131 min\n", - "\u001b[36m(orchestrate pid=1190951)\u001b[0m 18:40:27 INFO - done flushing in 0.004 sec\n", - "18:40:37 INFO - Completed execution in 0.462 min, execution result 0\n" + "22:43:10 INFO - text_encoder parameters are : {'content_column_name': 'contents', 'output_embeddings_column_name': 'embeddings', 'model_name': 'sentence-transformers/all-MiniLM-L6-v2'}\n", + "22:43:10 INFO - pipeline id pipeline_id\n", + "22:43:10 INFO - code location None\n", + "22:43:10 INFO - data factory data_ is using local data access: input_folder - output/04_exact_dedupe_out output_folder - output/05_embeddings_out\n", + "22:43:10 INFO - data factory data_ max_files -1, n_sample -1\n", + "22:43:10 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "22:43:10 INFO - orchestrator text_encoder started at 2024-10-16 22:43:10\n", + "22:43:10 INFO - Number of files is 2, source profile {'max_file_size': 0.010450363159179688, 'min_file_size': 0.010318756103515625, 'total_file_size': 0.020769119262695312}\n", + "22:43:12 INFO - Completed 1 files (50.0%) in 0.004 min\n", + "22:43:12 INFO - Completed 2 files (100.0%) in 0.004 min\n", + "22:43:12 INFO - Done processing 2 files, waiting for flush() completion.\n", + "22:43:12 INFO - done flushing in 0.0 sec\n", + "22:43:12 INFO - Completed execution in 0.039 min, execution result 0\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ - "✅ Stage:5 completed successfully\n", - "CPU times: user 457 ms, sys: 296 ms, total: 753 ms\n", - "Wall time: 29.2 s\n" + "✅ Stage:6 completed successfully\n", + "CPU times: user 671 ms, sys: 230 ms, total: 901 ms\n", + "Wall time: 2.8 s\n" ] } ], "source": [ "%%time\n", "\n", - "import os\n", - "import sys\n", - "\n", - "from data_processing.utils import ParamsUtils\n", - "from fdedup_transform_ray import FdedupRayTransformConfiguration\n", - "from data_processing_ray.runtime.ray import RayTransformLauncher\n", - "\n", - "# create parameters\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from text_encoder_local_python import TextEncoderPythonTransformConfiguration\n", "\n", "local_conf = {\n", " \"input_folder\": input_folder,\n", " \"output_folder\": output_folder,\n", "}\n", - "worker_options = {\"num_cpus\" : MY_CONFIG.RAY_NUM_CPUS}\n", - "code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n", "params = {\n", - " # where to run\n", - " \"run_locally\": True,\n", " # Data access. Only required parameters are specified\n", " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " # Orchestration parameters\n", - " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n", - " \"runtime_num_workers\": MY_CONFIG.RAY_RUNTIME_WORKERS,\n", - " # columns used\n", - " \"fdedup_doc_column\": \"contents\",\n", - " \"fdedup_id_column\": \"chunk_id\",\n", - " \"fdedup_cluster_column\": \"chunk_hash\",\n", - " # infrastructure\n", - " \"fdedup_bucket_cpu\": 0.3,\n", - " \"fdedup_doc_cpu\": 0.3,\n", - " \"fdedup_mhash_cpu\": 0.3,\n", - " \"fdedup_num_doc_actors\": 1,\n", - " \"fdedup_num_bucket_actors\": 1,\n", - " \"fdedup_num_minhash_actors\": 1,\n", - " \"fdedup_num_preprocessors\": 1,\n", - " # fuzzy parameters\n", - " \"fdedup_num_permutations\": 64,\n", - " \"fdedup_threshold\": 0.7, # (default 0.8)\n", - " \"fdedup_shingles_size\": 5,\n", - " \"fdedup_delimiters\": \" \"\n", + " # text_encoder\n", + " \"text_encoder_model_name\": MY_CONFIG.EMBEDDING_MODEL,\n", "}\n", "\n", - "# Pass commandline params\n", "sys.argv = ParamsUtils.dict_to_req(d=params)\n", - "\n", - "# launch\n", - "\n", - "launcher = RayTransformLauncher(FdedupRayTransformConfiguration())\n", + "# create launcher\n", + "launcher = PythonTransformLauncher(TextEncoderPythonTransformConfiguration())\n", "\n", "return_code = launcher.launch()\n", "\n", "if return_code == 0:\n", " print (f\"✅ Stage:{STAGE} completed successfully\")\n", "else:\n", - " raise Exception (\"❌ Ray job failed\")" + " raise Exception (\"❌ Job failed\")" ] }, { "cell_type": "markdown", - "id": "a6f8cd11", + "id": "b734852c", "metadata": { - "id": "a6f8cd11" + "id": "b734852c" }, "source": [ - "### 7.3 - Inspect Generated output" + "### 8.3 - Inspect Generated output\n", + "\n", + "You will see a column called `embeddings` added at the end. This the text content converted into vectors or embeddings. We used the model `sentence-transformers/all-MiniLM-L6-v2`" ] }, { "cell_type": "code", "execution_count": 28, - "id": "e899ad60", + "id": "7b1c1d09", "metadata": { "colab": { "base_uri": "https://localhost:8080/", - "height": 222 + "height": 760 }, - "id": "e899ad60", - "outputId": "17aaaea8-a106-4c9a-ceb3-6760d92f8b59" + "id": "7b1c1d09", + "outputId": "018daa18-e5db-4483-d8d5-30aded80d5e3" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Input data dimensions (rows x columns)= (7, 18)\n", - "Output data dimensions (rows x columns)= (6, 18)\n", - "Duplicate chunks removed by fuzzy-dedupe: 1\n" + "Input data dimensions (rows x columns)= (7, 19)\n", + "Output data dimensions (rows x columns)= (7, 20)\n" ] }, { @@ -2962,20 +2962,22 @@ " num_pages\n", " num_tables\n", " num_doc_elements\n", - " document_id\n", " ext\n", " hash\n", " size\n", " date_acquired\n", " pdf_convert_time\n", " source_filename\n", + " source_document_id\n", " contents\n", " doc_jsonpath\n", " page_number\n", " bbox\n", + " document_id\n", + " chunk_hash\n", " chunk_id\n", " removed\n", - " chunk_hash\n", + " embeddings\n", " \n", " \n", " \n", @@ -2985,186 +2987,255 @@ " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", + " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", + " Solar System\\nFor more details about the Solar...\n", + " $.main-text[3]\n", + " 1\n", + " [133.18510437, 570.83258057, 374.99838257, 581...\n", + " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", + " dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07...\n", + " 5\n", + " [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567...\n", + " [-0.051861435, 0.0035226212, 0.030617002, 0.04...\n", + " \n", + " \n", + " 1\n", + " mars.pdf\n", + " 1\n", + " 0\n", + " 11\n", + " pdf\n", + " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", + " 2800\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Mars\\nMars, the fourth planet from the Sun, is...\n", " $.main-text[5]\n", " 1\n", " [132.87440491, 500.84011841, 477.48345947, 534...\n", + " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", + " a31663e06fac41470ecc459f5a58658a3f9997d7801053...\n", " 6\n", " []\n", - " -1\n", + " [0.07728295, 0.024970993, -0.043180738, 0.0580...\n", " \n", " \n", - " 1\n", + " 2\n", " mars.pdf\n", " 1\n", " 0\n", " 11\n", - " 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1\n", " pdf\n", " 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7...\n", " 2800\n", - " 2024-09-18T18:40:07.682106\n", - " 0.838944\n", + " 2024-10-16T22:43:08.048035\n", + " 0.827872\n", " mars.pdf\n", + " 07bc0c9a-f863-48e3-9aed-bd289af040bc\n", " Basic facts about Mars:\\n· Distance from the S...\n", " $.main-text[6]\n", " 1\n", " [133.2026062, 482.90710449, 237.04431152, 493....\n", + " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", + " 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a...\n", " 7\n", " []\n", - " -1\n", + " [0.10598018, 0.025460618, 0.023627337, 0.03905...\n", " \n", " \n", - " 2\n", + " 3\n", " earth.pdf\n", " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Solar System\\nOur solar system is a vast and f...\n", " $.main-text[2]\n", " 1\n", " [132.87112427, 588.96014404, 479.40917969, 623...\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", + " 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674...\n", " 0\n", " []\n", - " -1\n", + " [0.0077404436, -0.02055944, 0.026426593, 0.011...\n", " \n", " \n", - " 3\n", + " 4\n", " earth.pdf\n", " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Solar System\\nFor more details about our Solar...\n", " $.main-text[3]\n", " 1\n", " [133.20942688, 570.81555176, 375.57919312, 581...\n", + " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", + " d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d...\n", " 1\n", " []\n", - " 5\n", + " [-0.062105548, -0.0053322907, 0.031277698, 0.0...\n", " \n", " \n", - " 4\n", + " 5\n", " earth.pdf\n", " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Earth\\nEarth is the third planet from the Sun....\n", " $.main-text[5]\n", " 1\n", " [132.91053772, 512.46295166, 477.84887695, 534...\n", + " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", + " 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3...\n", " 2\n", " []\n", - " -1\n", + " [0.072435796, -0.058001805, -0.019771898, -0.0...\n", " \n", " \n", - " 5\n", + " 6\n", " earth.pdf\n", " 1\n", " 0\n", " 11\n", - " e1053a34-3cc1-45c1-abe7-204a240152c0\n", " pdf\n", " 18713f970989055625bef22209b6f4b6830b9ca22046bf...\n", " 2686\n", - " 2024-09-18T18:40:06.831334\n", - " 0.857239\n", + " 2024-10-16T22:43:07.205350\n", + " 0.921915\n", " earth.pdf\n", + " e141f7a4-3e45-4f04-88d3-60e0a81b195b\n", " Earth\\nBasic facts about Earth:\\n· Distance fr...\n", " $.main-text[6]\n", " 1\n", " [133.30151367, 494.86206055, 240.17156982, 505...\n", + " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", + " 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f...\n", " 3\n", " []\n", - " -1\n", + " [0.091821924, 0.015197902, 0.07716932, 0.01711...\n", " \n", " \n", "\n", "" ], "text/plain": [ - " filename num_pages num_tables num_doc_elements \\\n", - "0 mars.pdf 1 0 11 \n", - "1 mars.pdf 1 0 11 \n", - "2 earth.pdf 1 0 11 \n", - "3 earth.pdf 1 0 11 \n", - "4 earth.pdf 1 0 11 \n", - "5 earth.pdf 1 0 11 \n", - "\n", - " document_id ext \\\n", - "0 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "1 25064eb4-470e-4d7e-b2f5-84d59cbbe6f1 pdf \n", - "2 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "3 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "4 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", - "5 e1053a34-3cc1-45c1-abe7-204a240152c0 pdf \n", + " filename num_pages num_tables num_doc_elements ext \\\n", + "0 mars.pdf 1 0 11 pdf \n", + "1 mars.pdf 1 0 11 pdf \n", + "2 mars.pdf 1 0 11 pdf \n", + "3 earth.pdf 1 0 11 pdf \n", + "4 earth.pdf 1 0 11 pdf \n", + "5 earth.pdf 1 0 11 pdf \n", + "6 earth.pdf 1 0 11 pdf \n", "\n", " hash size \\\n", "0 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", "1 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", - "2 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "2 8edd5dfbf888777120b528a5d8998f2757d006df0eaef7... 2800 \n", "3 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "4 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "5 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", + "6 18713f970989055625bef22209b6f4b6830b9ca22046bf... 2686 \n", "\n", " date_acquired pdf_convert_time source_filename \\\n", - "0 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "1 2024-09-18T18:40:07.682106 0.838944 mars.pdf \n", - "2 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "3 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "4 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", - "5 2024-09-18T18:40:06.831334 0.857239 earth.pdf \n", + "0 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "1 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "2 2024-10-16T22:43:08.048035 0.827872 mars.pdf \n", + "3 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "4 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "5 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "6 2024-10-16T22:43:07.205350 0.921915 earth.pdf \n", + "\n", + " source_document_id \\\n", + "0 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "1 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "2 07bc0c9a-f863-48e3-9aed-bd289af040bc \n", + "3 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "4 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "5 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", + "6 e141f7a4-3e45-4f04-88d3-60e0a81b195b \n", "\n", " contents doc_jsonpath \\\n", - "0 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", - "1 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", - "2 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", - "3 Solar System\\nFor more details about our Solar... $.main-text[3] \n", - "4 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", - "5 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", + "0 Solar System\\nFor more details about the Solar... $.main-text[3] \n", + "1 Mars\\nMars, the fourth planet from the Sun, is... $.main-text[5] \n", + "2 Basic facts about Mars:\\n· Distance from the S... $.main-text[6] \n", + "3 Solar System\\nOur solar system is a vast and f... $.main-text[2] \n", + "4 Solar System\\nFor more details about our Solar... $.main-text[3] \n", + "5 Earth\\nEarth is the third planet from the Sun.... $.main-text[5] \n", + "6 Earth\\nBasic facts about Earth:\\n· Distance fr... $.main-text[6] \n", + "\n", + " page_number bbox \\\n", + "0 1 [133.18510437, 570.83258057, 374.99838257, 581... \n", + "1 1 [132.87440491, 500.84011841, 477.48345947, 534... \n", + "2 1 [133.2026062, 482.90710449, 237.04431152, 493.... \n", + "3 1 [132.87112427, 588.96014404, 479.40917969, 623... \n", + "4 1 [133.20942688, 570.81555176, 375.57919312, 581... \n", + "5 1 [132.91053772, 512.46295166, 477.84887695, 534... \n", + "6 1 [133.30151367, 494.86206055, 240.17156982, 505... \n", + "\n", + " document_id \\\n", + "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... \n", + "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... \n", + "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... \n", + "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... \n", + "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... \n", + "\n", + " chunk_hash chunk_id \\\n", + "0 dee4c03474c98efdabbadbcc4ce91138c7820f4ac8ff07... 5 \n", + "1 a31663e06fac41470ecc459f5a58658a3f9997d7801053... 6 \n", + "2 7ff317954ec5f3b15607c053c30c2b0db0f6b64cc3295a... 7 \n", + "3 44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf5674... 0 \n", + "4 d7be13d7dee96cf2384072d0eb01981e0e75eec2e7bc6d... 1 \n", + "5 7c4a750e2215f231803a6f8078bde1e9699034fb033dd3... 2 \n", + "6 189a221704d17feeb96b1b1ef60a2a2445459848cd8e8f... 3 \n", "\n", - " page_number bbox chunk_id \\\n", - "0 1 [132.87440491, 500.84011841, 477.48345947, 534... 6 \n", - "1 1 [133.2026062, 482.90710449, 237.04431152, 493.... 7 \n", - "2 1 [132.87112427, 588.96014404, 479.40917969, 623... 0 \n", - "3 1 [133.20942688, 570.81555176, 375.57919312, 581... 1 \n", - "4 1 [132.91053772, 512.46295166, 477.84887695, 534... 2 \n", - "5 1 [133.30151367, 494.86206055, 240.17156982, 505... 3 \n", + " removed \\\n", + "0 [44c6e373258c7cdc03f75a8e96a9b160f9aa4e4baf567... \n", + "1 [] \n", + "2 [] \n", + "3 [] \n", + "4 [] \n", + "5 [] \n", + "6 [] \n", "\n", - " removed chunk_hash \n", - "0 [] -1 \n", - "1 [] -1 \n", - "2 [] -1 \n", - "3 [] 5 \n", - "4 [] -1 \n", - "5 [] -1 " + " embeddings \n", + "0 [-0.051861435, 0.0035226212, 0.030617002, 0.04... \n", + "1 [0.07728295, 0.024970993, -0.043180738, 0.0580... \n", + "2 [0.10598018, 0.025460618, 0.023627337, 0.03905... \n", + "3 [0.0077404436, -0.02055944, 0.026426593, 0.011... \n", + "4 [-0.062105548, -0.0053322907, 0.031277698, 0.0... \n", + "5 [0.072435796, -0.058001805, -0.019771898, -0.0... \n", + "6 [0.091821924, 0.015197902, 0.07716932, 0.01711... " ] }, "execution_count": 28, @@ -3179,645 +3250,37 @@ "\n", "print (\"Input data dimensions (rows x columns)= \", input_df.shape)\n", "print (\"Output data dimensions (rows x columns)= \", output_df.shape)\n", - "print (\"Duplicate chunks removed by fuzzy-dedupe: \", (input_df.shape[0] - output_df.shape[0]))\n", "\n", "output_df.head(10)" ] }, + { + "cell_type": "markdown", + "id": "f5e12630-be6b-4188-a925-77117155617b", + "metadata": { + "id": "f5e12630-be6b-4188-a925-77117155617b" + }, + "source": [ + "## Step-9: Copy output to final output dir" + ] + }, { "cell_type": "code", "execution_count": 29, - "id": "ab7ea52b", + "id": "16dee3b8-31dc-4168-8adb-f2a0a0b5e207", "metadata": { "colab": { - "base_uri": "https://localhost:8080/", - "height": 81 + "base_uri": "https://localhost:8080/" }, - "id": "ab7ea52b", - "outputId": "8e57385f-c925-4ac7-9e0d-ebc64e92530a" - }, - "outputs": [ - { - "data": { - "text/html": [ - "