Skip to content

Commit

Permalink
updated with relative path and added markdown for notebook
Browse files Browse the repository at this point in the history
Signed-off-by: Sungeun An <[email protected]>
  • Loading branch information
Sungeun An committed Nov 21, 2024
1 parent 76e067d commit b018b22
Show file tree
Hide file tree
Showing 2 changed files with 40 additions and 27 deletions.
57 changes: 36 additions & 21 deletions transforms/language/html2parquet/notebooks/html2parquet.ipynb
Original file line number Diff line number Diff line change
@@ -1,9 +1,17 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "8435e1f7-0c2e-49f4-a77a-b525ee6c532b",
"metadata": {},
"source": [
"# Html2Parquet Transform Sample Notebook"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "c4f9c952-cb3b-40f1-bfb5-00d9a43a5715",
"execution_count": null,
"id": "d9420989-ec8a-4fde-9a93-dc25096389f1",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -26,6 +34,14 @@
"from html2parquet_transform_python import Html2ParquetPythonTransformConfiguration\n"
]
},
{
"cell_type": "markdown",
"id": "6d85491b-0093-46e7-8653-ca8052ea59f0",
"metadata": {},
"source": [
"## Specify input/output folders and parameters"
]
},
{
"cell_type": "code",
"execution_count": 3,
Expand All @@ -37,7 +53,7 @@
"\n",
"# create parameters\n",
"local_conf = {\n",
" \"input_folder\": \"/path/to/your/input/folder\",\n",
" \"input_folder\": \"/path/to/your/input/folder\", # For the sample input files, refer to the 'python/test-data/input' folder\n",
" \"output_folder\": \"/path/to/your/output/folder\",\n",
"}\n",
"\n",
Expand All @@ -48,6 +64,14 @@
"}\n"
]
},
{
"cell_type": "markdown",
"id": "0dcd1249-1eb8-4b33-9827-626f90c840b4",
"metadata": {},
"source": [
"## Invoke the html2parquet transformation"
]
},
{
"cell_type": "code",
"execution_count": 4,
Expand All @@ -74,7 +98,6 @@
}
],
"source": [
"\n",
"import sys\n",
"sys.argv = ParamsUtils.dict_to_req(d=(params))\n",
"# create launcher\n",
Expand All @@ -83,6 +106,14 @@
"return_code = launcher.launch()\n"
]
},
{
"cell_type": "markdown",
"id": "3c66468d-703f-427f-a1dd-a758edd334de",
"metadata": {},
"source": [
"## Checking the output Parquet file"
]
},
{
"cell_type": "code",
"execution_count": 5,
Expand Down Expand Up @@ -178,22 +209,6 @@
"source": [
"table.to_pandas()['contents'][0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2fd0d13b-1ff6-4988-91fb-52c25ba998c8",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "587e43ee-7b51-4a9c-8bf2-0a23e309a7ae",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand All @@ -212,7 +227,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.10"
"version": "3.11.9"
}
},
"nbformat": 4,
Expand Down
10 changes: 4 additions & 6 deletions transforms/language/html2parquet/python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,18 +18,15 @@ This transform iterates through zipped collections of HTML files or single HTML
## Date

**Last updated:** 10/16/24
- **Update details:**
- Added Trafilatura parameters (`favor_precision` and `favor_recall`) for enhanced control over content extraction.
- Enhanced table and image extraction features.
- See [Pull Request #707](https://github.com/IBM/data-prep-kit/pull/707) for more details.
**Update details:** Enhanced table and image extraction features by adding the corresponding Trafilatura parameters.

---

## Input and Output

### Input
- Accepted Formats: Single HTML files or zipped collections of HTML files.
- Sample Input Files: [sample html files](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/html2parquet/python/test-data/input)
- Sample Input Files: [sample html files](test-data/input)

### Output
- Format: Parquet files with the following structure:
Expand Down Expand Up @@ -205,7 +202,8 @@ python ../html2parquet/python/src/html2parquet_transform_python.py \

### Sample Notebook

See the [sample notebook](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/html2parquet/notebooks/html2parquet.ipynb) for an example.
See the [sample notebook](../notebooks/html2parquet.ipynb)
) for an example.


## Further Resources
Expand Down

0 comments on commit b018b22

Please sign in to comment.