Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] add an example of html2pq in the documentation #788

Open
2 tasks done
sujee opened this issue Nov 8, 2024 · 17 comments
Open
2 tasks done

[Feature] add an example of html2pq in the documentation #788

sujee opened this issue Nov 8, 2024 · 17 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@sujee
Copy link
Contributor

sujee commented Nov 8, 2024

Search before asking

  • I searched the issues and found no similar issues.

Component

Tools/ingest2parquet

Feature

https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/html2parquet/python/README.md

shows input HTML and output MD. But doesn't have a sample code 😄

We should provide sample code

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@sujee sujee added the enhancement New feature or request label Nov 8, 2024
@Bytes-Explorer
Copy link
Collaborator

Sample code is available here https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/html2parquet/python/src/html2parquet_local.py

However, there seems to be some issue with it

@sungeunan-ibm When i try to run the test files, I get the below error. Can you pls check?

(data-prep-kit-html) himapatel@Himas-MacBook-Pro-2 src % python html2parquet_local.py
Traceback (most recent call last):
 File "/Users/himapatel/Work/Projects/MCD/OpenSource/html-test/data-prep-kit/transforms/language/html2parquet/python/src/html2parquet_local.py", line 15, in <module>
  from data_processing.data_access import DataAccessLocal
ModuleNotFoundError: No module named 'data_processing'

@Bytes-Explorer
Copy link
Collaborator

@Bytes-Explorer Bytes-Explorer added bug Something isn't working and removed enhancement New feature or request labels Nov 8, 2024
@touma-I
Copy link
Collaborator

touma-I commented Nov 8, 2024

Sample code is available here https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/html2parquet/python/src/html2parquet_local.py

However, there seems to be some issue with it

@sungeunan-ibm When i try to run the test files, I get the below error. Can you pls check?

(data-prep-kit-html) himapatel@Himas-MacBook-Pro-2 src % python html2parquet_local.py
Traceback (most recent call last):
 File "/Users/himapatel/Work/Projects/MCD/OpenSource/html-test/data-prep-kit/transforms/language/html2parquet/python/src/html2parquet_local.py", line 15, in <module>
  from data_processing.data_access import DataAccessLocal
ModuleNotFoundError: No module named 'data_processing'

@Bytes-Explorer I does not look like you setup the environment properly:

cd transforms/language/html2parquet/python
make venv
source venv/bin/activate
python src//html2parquet_local.py

@Bytes-Explorer
Copy link
Collaborator

Yes, right. I will try again after building the environment.

This is another reason why we should simplify and everything should happen out of pip install. I am glad we are on that journey.

@touma-I
Copy link
Collaborator

touma-I commented Nov 8, 2024

@sujee you should be able to use this transform very much like you the pdf2parquet. The only caveat is that they cannot be installed together: either pdf2parquet or html2parquet can be installed in your environment.

from data_processing.runtime.pure_python import PythonTransformLauncher
from html2parquet_transform_python import Html2ParquetPythonTransformConfiguration

launcher = PythonTransformLauncher(Html2ParquetPythonTransformConfiguration())
launcher.launch()

@daw3rd
Copy link
Member

daw3rd commented Nov 8, 2024

We seem to be mixing use cases here.

  1. Notebook-based user of a transform
  2. transforms/html2parquet-based developer of a transform.

For 1, agreed pip install should be used. For 2, make venv should be used.

@touma-I
Copy link
Collaborator

touma-I commented Nov 8, 2024

Yes, right. I will try again after building the environment.

This is another reason why we should simplify and everything should happen out of pip install. I am glad we are on that journey.

You can use pip install if you want. I was simply responding to your comment as you seem to be trying to run the test example from the test folder

@touma-I touma-I added documentation Improvements or additions to documentation and removed bug Something isn't working labels Nov 8, 2024
@touma-I
Copy link
Collaborator

touma-I commented Nov 8, 2024

@sujee @Bytes-Explorer changing this from Bug to Documentation.
@shahrokh Where are we capturing the documentation for something like this ? I would make sense to have it in the Readme.md for the transform ?

@touma-I
Copy link
Collaborator

touma-I commented Nov 8, 2024

Search before asking

  • I searched the issues and found no similar issues.

Component

Tools/ingest2parquet

Feature

https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/html2parquet/python/README.md

shows input HTML and output MD. But doesn't have a sample code 😄

We should provide sample code

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Agree, @sungeunan-ibm could you please update your Readme.md to have an example showing how a notebook user would use your transform ? Please reach out if you need help with this. Thanks

@matouma
Copy link

matouma commented Nov 13, 2024

@sujee Can you attach some sample html to this issue ? Just one or two html files.

@sujee
Copy link
Contributor Author

sujee commented Nov 13, 2024

@touma-I this is not tied any particular html input.

Just need a sample python code to transform HTML --> MD.

@matouma
Copy link

matouma commented Nov 13, 2024

@sujee I understand. I just need any html

@sujee
Copy link
Contributor Author

sujee commented Nov 13, 2024

Here is a sample html

ai-alliance-index.html.txt

(I had to add .txt extension to html file, so I can attach here)

@touma-I
Copy link
Collaborator

touma-I commented Nov 13, 2024

@sungeunan-ibm When you are back, please see how I did it and let me know if we need to change anything. https://github.com/touma-I/data-prep-kit-pkg/blob/html2parquet-example/transforms/language/html2parquet/notebooks/html2parquet.ipynb cc: @shahrokhDaijavad

@sujee
Copy link
Contributor Author

sujee commented Nov 14, 2024

Just add a link to this notebook in html2pq README so it's linked.

Great work @touma-I 👏

@sungeunan-ibm
Copy link
Collaborator

Great! Thank you, @touma-I

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

6 participants