-
Notifications
You must be signed in to change notification settings - Fork 1
Add Data modelling section #270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
# Vector data | ||
|
||
CrateDB natively supports **vector embeddings** for efficient **similarity | ||
search** using **k-nearest neighbour (kNN)** algorithms. This makes it a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CrateDB supports 2 ways of doing similarity search afaik: One is knn_match
which indeed uses a k-nearest neighbour search algorithm, but the other one is vector_similarity
, which computes the euclidean distance. I would perhaps add here a mention that CrateDB supports similarity search with k-nearest neighbour (kNN) algorithm and euclidean distance
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juanpardo Can you please make suggested edits instead? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think that knn_search
uses ANN
instead of KNN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juanpardo will you take lead on resolving this potentially with any update if needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was wrong, I'm unsure. CrateDB ultimately uses KnnFloatVectorQuery which has the approximateSearch method, but the docstring all say 'k nearest document' (kNN), but in searchNearestVectors there is the reference "The search is allowed to be approximate, meaning the results are not guaranteed to be the true k closest neighbors" which literally sounds like aNN
.
So there are a few approximate
references, I guess my confusion is that I don't know if the current implementation just allows to swap between aNN
and kNN
as needed, as it would make sense in a large vector store to use aNN
for performance reasons or if we always use kNN
but it has some kind of improvement to let it be more 'aproximate' (while still not being aNN
, which is a different algorithm).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've asked internally, I can take lead in resolving any potential change in this section if that's alright with y'all @bmunkholm @juanpardo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. We can correct that later when we figure it out.
Added a simple table definition to allow the following join query to successfully run
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed bits and reviewed to the best of my ability, there are things that I'd wish to improve but it's not my call (have less marketing statements).
docs/start/modelling/primary-key.md
Outdated
This option involves declaring a column using `DEFAULT gen_random_text_uuid()`. | ||
```psql | ||
CREATE TABLE example2 ( | ||
CREATE TABLE example ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@surister It's intentional that I enumerate the table names where we redefine or evolve table definitions. This makes it more practical to copy/paste and test examples, and makes it clear to readers which tables we refer to afterwards.
It's certainly within the scope to adjust the AI marketing prose to a useful level. This section is somewhat introductory, so it's ok that it explains and sells the features a bit, but not in too much "marketing" terms, but more educational in explaining what we are good at. |
I created #275 to fix the 404 links breaking in the build. Also nyc.gov 404, but that seems temporarily. |
Summary of the changes / Why this is an improvement
Based on #233.
Current focus is adding the Data Modeling section and pages and adjust outline closer to Gitbook.
All pages are still under "Getting Started".
Checklist
Preview
https://cratedb-guide--270.org.readthedocs.build/