Skip to content

Conversation

bmunkholm
Copy link
Contributor

@bmunkholm bmunkholm commented Aug 28, 2025

Summary of the changes / Why this is an improvement

Based on #233.
Current focus is adding the Data Modeling section and pages and adjust outline closer to Gitbook.
All pages are still under "Getting Started".

Checklist

  • Update FTS
  • Update Geospatial
  • Update json
  • Update relational
  • Update timeseries
  • Update vector

Preview

https://cratedb-guide--270.org.readthedocs.build/

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as outdated.

@bmunkholm bmunkholm marked this pull request as ready for review September 2, 2025 22:51
# Vector data

CrateDB natively supports **vector embeddings** for efficient **similarity
search** using **k-nearest neighbour (kNN)** algorithms. This makes it a

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CrateDB supports 2 ways of doing similarity search afaik: One is knn_match which indeed uses a k-nearest neighbour search algorithm, but the other one is vector_similarity, which computes the euclidean distance. I would perhaps add here a mention that CrateDB supports similarity search with k-nearest neighbour (kNN) algorithm and euclidean distance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juanpardo Can you please make suggested edits instead? Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think that knn_search uses ANN instead of KNN

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juanpardo will you take lead on resolving this potentially with any update if needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was wrong, I'm unsure. CrateDB ultimately uses KnnFloatVectorQuery which has the approximateSearch method, but the docstring all say 'k nearest document' (kNN), but in searchNearestVectors there is the reference "The search is allowed to be approximate, meaning the results are not guaranteed to be the true k closest neighbors" which literally sounds like aNN .

So there are a few approximate references, I guess my confusion is that I don't know if the current implementation just allows to swap between aNN and kNN as needed, as it would make sense in a large vector store to use aNN for performance reasons or if we always use kNN but it has some kind of improvement to let it be more 'aproximate' (while still not being aNN, which is a different algorithm).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've asked internally, I can take lead in resolving any potential change in this section if that's alright with y'all @bmunkholm @juanpardo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. We can correct that later when we figure it out.

Added a simple table definition to allow the following join query to successfully run
@bmunkholm

This comment was marked as resolved.

@surister

This comment was marked as resolved.

Copy link
Member

@surister surister left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed bits and reviewed to the best of my ability, there are things that I'd wish to improve but it's not my call (have less marketing statements).

This option involves declaring a column using `DEFAULT gen_random_text_uuid()`.
```psql
CREATE TABLE example2 (
CREATE TABLE example (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@surister It's intentional that I enumerate the table names where we redefine or evolve table definitions. This makes it more practical to copy/paste and test examples, and makes it clear to readers which tables we refer to afterwards.

@bmunkholm
Copy link
Contributor Author

bmunkholm commented Sep 11, 2025

@surister

there are things that I'd wish to improve but it's not my call (have less marketing statements).

It's certainly within the scope to adjust the AI marketing prose to a useful level. This section is somewhat introductory, so it's ok that it explains and sells the features a bit, but not in too much "marketing" terms, but more educational in explaining what we are good at.
I suggest we take another round at that after doing the search pages. Perhaps there are more to adjust based on those.

@bmunkholm
Copy link
Contributor Author

I created #275 to fix the 404 links breaking in the build. Also nyc.gov 404, but that seems temporarily.

@bmunkholm bmunkholm merged commit abde7e6 into main Sep 11, 2025
2 of 3 checks passed
@bmunkholm bmunkholm deleted the data-modelling2 branch September 11, 2025 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants