Skip to content

Commit

Permalink
First draft of fdedup notebook
Browse files Browse the repository at this point in the history
Signed-off-by: Constantin M Adam <[email protected]>
  • Loading branch information
cmadam committed Nov 25, 2024
1 parent 280d105 commit 1a762e0
Showing 1 changed file with 152 additions and 0 deletions.
152 changes: 152 additions & 0 deletions transforms/universal/fdedup/fdedup.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "afd55886-5f5b-4794-838e-ef8179fb0394",
"metadata": {},
"source": [
"##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n",
"```\n",
"make venv\n",
"source venv/bin/activate && pip install jupyterlab\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695",
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"## This is here as a reference only\n",
"# Users and application developers must use the right tag for the latest from pypi\n",
"#!pip install data-prep-toolkit\n",
"#!pip install data-prep-toolkit-transforms\n",
"#!pip install data-prep-connector"
]
},
{
"cell_type": "markdown",
"id": "ebf1f782-0e61-485c-8670-81066beb734c",
"metadata": {},
"source": [
"##### ***** Import required Classes and modules"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c2a12abc-9460-4e45-8961-873b48a9ab19",
"metadata": {},
"outputs": [],
"source": [
"import ast\n",
"import os\n",
"import sys\n",
"\n",
"from data_processing.utils import ParamsUtils\n",
"from fdedup_transform_python import parse_args\n",
"from fdedup_transform_ray import RayServiceOrchestrator"
]
},
{
"cell_type": "markdown",
"id": "7234563c-2924-4150-8a31-4aec98c1bf33",
"metadata": {},
"source": [
"##### ***** Setup runtime parameters for this transform"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e90a853e-412f-45d7-af3d-959e755aeebb",
"metadata": {},
"outputs": [],
"source": [
"# create parameters\n",
"input_folder = os.path.join(\"ray\", \"test-data\", \"input\")\n",
"output_folder = os.path.join( \"ray\", \"output\")\n",
"params = {\n",
" # transform configuration parameters\n",
" \"input_folder\": input_folder,\n",
" \"output_folder\": output_folder,\n",
" \"contents_column\": \"contents\",\n",
" \"document_id_column\": \"int_id_column\",\n",
" \"num_permutations\": 112,\n",
" \"num_bands\": 14,\n",
" \"num_minhashes_per_band\": 8,\n",
" \"num_segments\": 1,\n",
" \"operation_mode\": \"annotate\",\n",
" # ray configuration parameters\n",
" \"run_locally\": True,\n",
"}\n"
]
},
{
"cell_type": "markdown",
"id": "7949f66a-d207-45ef-9ad7-ad9406f8d42a",
"metadata": {},
"source": [
"##### ***** Use ray runtime to invoke the transform"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0775e400-7469-49a6-8998-bd4772931459",
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"args = parse_args()\n",
"# Initialize the orchestrator\n",
"orchestrator = RayServiceOrchestrator(global_params=args)\n",
"# Launch ray fuzzy dedup execution\n",
"orchestrator.orchestrate()\n"
]
},
{
"cell_type": "markdown",
"id": "c3df5adf-4717-4a03-864d-9151cd3f134b",
"metadata": {},
"source": [
"##### **** The specified folder will include the transformed parquet files."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7276fe84-6512-4605-ab65-747351e13a7c",
"metadata": {},
"outputs": [],
"source": [
"import glob\n",
"glob.glob(\"python/output/*\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.19"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

0 comments on commit 1a762e0

Please sign in to comment.