Skip to content

Add docs on generating embeddings from web #592

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

stwiname
Copy link
Contributor

@stwiname stwiname requested a review from jamesbayly January 29, 2025 04:02

```shell
subql-ai embed-mdx -i ./path/to/dir/with/markdown -o ./db --table your-table-name --model nomic-embed-text
```

### From Web

This will parse all the visible text from the specified web page(s). You can specify the scope for how many links are followed to pull in more data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we scrape these pages, it would be good to provide some details on the libarary we use. And I imagine there are some limitations on dynamic websites, e.g. does this work with websites that need to execute JS.

Finally, how can i verify if this was able to scrape my website, do we export the page content as text somewhere so i can verify this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants