-
Notifications
You must be signed in to change notification settings - Fork 140
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #815 from IBM/html2parquet
Html2Parquet Updated README and Added Sample Notebook
- Loading branch information
Showing
2 changed files
with
297 additions
and
8 deletions.
There are no files selected for viewing
235 changes: 235 additions & 0 deletions
235
transforms/language/html2parquet/notebooks/html2parquet.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,235 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "8435e1f7-0c2e-49f4-a77a-b525ee6c532b", | ||
"metadata": {}, | ||
"source": [ | ||
"# Html2Parquet Transform Sample Notebook" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "d9420989-ec8a-4fde-9a93-dc25096389f1", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%capture\n", | ||
"!pip install data-prep-toolkit==0.2.2.dev2\n", | ||
"!pip install 'data-prep-toolkit-transforms[html2parquet]==0.2.2.dev2'\n", | ||
"!pip install pandas" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"id": "20663a67-5aa1-4b61-b989-94201613e41f", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from data_processing.runtime.pure_python import PythonTransformLauncher\n", | ||
"from data_processing.utils import ParamsUtils\n", | ||
"\n", | ||
"from html2parquet_transform_python import Html2ParquetPythonTransformConfiguration\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "6d85491b-0093-46e7-8653-ca8052ea59f0", | ||
"metadata": {}, | ||
"source": [ | ||
"## Specify input/output folders and parameters" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"id": "e75f6922-eb0f-4164-a536-f96393e04604", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import ast\n", | ||
"\n", | ||
"# create parameters\n", | ||
"local_conf = {\n", | ||
" \"input_folder\": \"/path/to/your/input/folder\", # For the sample input files, refer to the 'python/test-data/input' folder\n", | ||
" \"output_folder\": \"/path/to/your/output/folder\",\n", | ||
"}\n", | ||
"\n", | ||
"params = {\n", | ||
" # Data access. Only required parameters are specified\n", | ||
" \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", | ||
" \"data_files_to_use\": ast.literal_eval(\"['.zip', '.html']\"),\n", | ||
"}\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "0dcd1249-1eb8-4b33-9827-626f90c840b4", | ||
"metadata": {}, | ||
"source": [ | ||
"## Invoke the html2parquet transformation" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"id": "4d2354db-1bb3-4a71-98df-f0f148af3a02", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"17:09:40 INFO - html2parquet parameters are : {'output_format': <html2parquet_output_format.MARKDOWN: 'markdown'>, 'favor_precision': <html2parquet_favor_precision.TRUE: 'True'>, 'favor_recall': <html2parquet_favor_recall.TRUE: 'True'>}\n", | ||
"17:09:40 INFO - pipeline id pipeline_id\n", | ||
"17:09:40 INFO - code location None\n", | ||
"17:09:40 INFO - data factory data_ is using local data access: input_folder - input output_folder - output\n", | ||
"17:09:40 INFO - data factory data_ max_files -1, n_sample -1\n", | ||
"17:09:40 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.html'], files to checkpoint ['.parquet']\n", | ||
"17:09:40 INFO - orchestrator html2parquet started at 2024-11-13 17:09:40\n", | ||
"17:09:40 INFO - Number of files is 1, source profile {'max_file_size': 0.2035503387451172, 'min_file_size': 0.2035503387451172, 'total_file_size': 0.2035503387451172}\n", | ||
"17:09:47 INFO - Completed 1 files (100.0%) in 0.111 min\n", | ||
"17:09:47 INFO - Done processing 1 files, waiting for flush() completion.\n", | ||
"17:09:47 INFO - done flushing in 0.0 sec\n", | ||
"17:09:47 INFO - Completed execution in 0.111 min, execution result 0\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import sys\n", | ||
"sys.argv = ParamsUtils.dict_to_req(d=(params))\n", | ||
"# create launcher\n", | ||
"launcher = PythonTransformLauncher(Html2ParquetPythonTransformConfiguration())\n", | ||
"# launch\n", | ||
"return_code = launcher.launch()\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "3c66468d-703f-427f-a1dd-a758edd334de", | ||
"metadata": {}, | ||
"source": [ | ||
"## Checking the output Parquet file" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"id": "e2bee8da-c566-4e45-bca1-354dfd04b0df", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/html": [ | ||
"<div>\n", | ||
"<style scoped>\n", | ||
" .dataframe tbody tr th:only-of-type {\n", | ||
" vertical-align: middle;\n", | ||
" }\n", | ||
"\n", | ||
" .dataframe tbody tr th {\n", | ||
" vertical-align: top;\n", | ||
" }\n", | ||
"\n", | ||
" .dataframe thead th {\n", | ||
" text-align: right;\n", | ||
" }\n", | ||
"</style>\n", | ||
"<table border=\"1\" class=\"dataframe\">\n", | ||
" <thead>\n", | ||
" <tr style=\"text-align: right;\">\n", | ||
" <th></th>\n", | ||
" <th>title</th>\n", | ||
" <th>document</th>\n", | ||
" <th>contents</th>\n", | ||
" <th>document_id</th>\n", | ||
" <th>size</th>\n", | ||
" <th>date_acquired</th>\n", | ||
" </tr>\n", | ||
" </thead>\n", | ||
" <tbody>\n", | ||
" <tr>\n", | ||
" <th>0</th>\n", | ||
" <td>ai-alliance-index.html</td>\n", | ||
" <td>ai-alliance-index.html</td>\n", | ||
" <td>![](https://images.prismic.io/ai-alliance/Ztf3...</td>\n", | ||
" <td>f86b8cebe07ec9f43a351bb4dc897f162f5a88cbb0f121...</td>\n", | ||
" <td>394</td>\n", | ||
" <td>2024-11-13T17:09:40.947095</td>\n", | ||
" </tr>\n", | ||
" </tbody>\n", | ||
"</table>\n", | ||
"</div>" | ||
], | ||
"text/plain": [ | ||
" title document \\\n", | ||
"0 ai-alliance-index.html ai-alliance-index.html \n", | ||
"\n", | ||
" contents \\\n", | ||
"0 ![](https://images.prismic.io/ai-alliance/Ztf3... \n", | ||
"\n", | ||
" document_id size \\\n", | ||
"0 f86b8cebe07ec9f43a351bb4dc897f162f5a88cbb0f121... 394 \n", | ||
"\n", | ||
" date_acquired \n", | ||
"0 2024-11-13T17:09:40.947095 " | ||
] | ||
}, | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"import pyarrow.parquet as pq\n", | ||
"import pandas as pd\n", | ||
"table = pq.read_table('/path/to/your/output/folder/sample.parquet')\n", | ||
"table.to_pandas()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"id": "cde6e37d-c437-490f-8e01-f4f51a123484", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"'![](https://images.prismic.io/ai-alliance/Ztf3gLzzk9ZrW8v8_caliopensourceslide.jpg?auto=format%2Ccompress&fit=max&w=3840)\\n\\n## Open Source AI Demo Night\\n\\nThe AI Alliance, in collaboration with Cerebral Valley and Ollama, hosted Open Source AI Demo Night in San Francisco, bringing together more than 200+ developers and innovators to showcase and celebrate the latest advances in open-source AI.'" | ||
] | ||
}, | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"table.to_pandas()['contents'][0]" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.9" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,25 +1,52 @@ | ||
# html2parquet Transform | ||
# HTML to Parquet Transform | ||
|
||
This tranforms iterate through zip of HTML files or single HTML files and generates parquet files containing the converted document in string. | ||
--- | ||
|
||
The HTML conversion is using the [Trafilatura](https://trafilatura.readthedocs.io/en/latest/usage-python.html). | ||
## Description | ||
|
||
## Output format | ||
This transform iterates through zipped collections of HTML files or single HTML files and generates Parquet files containing the extracted content, leveraging the [Trafilatura library](https://trafilatura.readthedocs.io/en/latest/usage-python.html) for extraction of text, tables, images, and other components. | ||
|
||
The output format will contain the following colums | ||
--- | ||
|
||
## Contributors | ||
|
||
- Sungeun An ([email protected]) | ||
- Syed Zawad ([email protected]) | ||
|
||
--- | ||
|
||
## Date | ||
|
||
**Last updated:** 10/16/24 | ||
**Update details:** Enhanced table and image extraction features by adding the corresponding Trafilatura parameters. | ||
|
||
--- | ||
|
||
## Input and Output | ||
|
||
### Input | ||
- Accepted Formats: Single HTML files or zipped collections of HTML files. | ||
- Sample Input Files: [sample html files](test-data/input) | ||
|
||
### Output | ||
- Format: Parquet files with the following structure: | ||
|
||
```jsonc | ||
{ | ||
"title": "string", // the member filename | ||
"document": "string", // the base of the source archive | ||
"contents": "string", // the content of the HTML | ||
"title": "string", // the member filename | ||
"document": "string", // the base of the source archive | ||
"contents": "string", // the content of the HTML | ||
"document_id": "string", // the document id, a hash of `contents` | ||
"size": "string", // the size of `contents` | ||
"date_acquired": "date", // the date when the transform was executing | ||
} | ||
``` | ||
|
||
|
||
## Parameters | ||
|
||
### User-Configurable Parameters | ||
|
||
The table below provides the parameters that users can adjust to control the behavior of the extraction: | ||
|
||
| Parameter | Default | Description | | ||
|
@@ -28,6 +55,8 @@ The table below provides the parameters that users can adjust to control the beh | |
| `favor_precision` | `True` | Prefers less content but more accurate extraction. Options: `True`, `False`. | | ||
| `favor_recall` | `True` | Extracts more content when uncertain. Options: `True`, `False`. | | ||
|
||
### Default Parameters | ||
|
||
The table below provides the parameters that are enabled by default to ensure a comprehensive extraction process: | ||
|
||
| Parameter | Default | Description | | ||
|
@@ -43,6 +72,7 @@ The table below provides the parameters that are enabled by default to ensure a | |
- To prioritize extracting more content over accuracy, set `favor_recall=True` and `favor_precision=False`. | ||
- When invoking the CLI, use the following syntax for these parameters: `--html2parquet_<parameter_name>`. For example: `--html2parquet_output_format='markdown'`. | ||
|
||
|
||
## Example | ||
|
||
### Sample HTML | ||
|
@@ -155,3 +185,27 @@ Chicago | | |
## Contact Us | ||
``` | ||
|
||
## Usage | ||
|
||
### Command-Line Interface (CLI) | ||
|
||
Run the transform with the following command: | ||
|
||
``` | ||
python ../html2parquet/python/src/html2parquet_transform_python.py \ | ||
--data_local_config "{'input_folder': '../html2parquet/python/test-data/input', 'output_folder': '../html2parquet/python/test-data/expected'}" \ | ||
--data_files_to_use '[".html", ".zip"]' | ||
``` | ||
|
||
- When invoking the CLI, use the following syntax for these parameters: `--html2parquet_<parameter_name>`. For example: `--html2parquet_output_format='markdown'`. | ||
|
||
|
||
### Sample Notebook | ||
|
||
See the [sample notebook](../notebooks/html2parquet.ipynb) | ||
) for an example. | ||
|
||
|
||
## Further Resources | ||
|
||
- [Trafilatura](https://trafilatura.readthedocs.io/en/latest/usage-python.html). |