Skip to content

Commit

Permalink
Update vec0.md (#177)
Browse files Browse the repository at this point in the history
fixed a lot of typos and cleaned up the language (thanks for a great extension)
  • Loading branch information
punkish authored Jan 18, 2025
1 parent bdc336d commit f93bc5b
Showing 1 changed file with 15 additions and 15 deletions.
30 changes: 15 additions & 15 deletions site/features/vec0.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
## Metadata in `vec0` Virtual Tables

There are three ways to store non-vector columns in `vec0` virtual tables:
metadata columns, partition keys, and auxiliary columns. Each options has their
metadata columns, partition keys, and auxiliary columns. Each option has its
own benefits and limitations.

```sql
Expand Down Expand Up @@ -48,7 +48,7 @@ create virtual table vec_movies using vec0(
```

In the `vec0` constructor, the `genre`, `num_reviews`, `mean_rating`, and
`contains_violence` columns are metadata columns, with their specified type.
`contains_violence` columns are metadata columns, with their specified types.

A sample KNN query on this table could look like:

Expand All @@ -64,10 +64,10 @@ where synopsis_embedding match '[...]'
```

The first two conditions in the `WHERE` clause (`synopsis_embedding match` and
`k = 5`) denote that the query in a KNN query. The other conditions are metadata
constraints, that `sqlite-vec` will recognize and apply during the KNN
`k = 5`) denote that the query is a KNN query. The other conditions are metadata
constraints that `sqlite-vec` will recognize and apply during the KNN
calculation. In other words, for the above query, a maximum of 5 rows would be
returned, all of which would fit under all the `WHERE` constraints for their
returned, all of which would match all the `WHERE` constraints for their
metadata column values.

#### Metadata Column Declaration
Expand Down Expand Up @@ -110,11 +110,12 @@ Boolean columns only support `=` and `!=` operators.
### Partition Key Columns {#partition-keys}

Partition key columns allow one to internally shard a vector indexed based on a
given key. Any `=` constraint in a `WHERE` clause on a partition key column will
given key. Any `=` constraint in a `WHERE` clause on a partition key column will
restrict the search to that clause.

For example, say you're performing vector search on a large dataset of
documents. However, each document belongs to a user, and users can only search
their own documents. It would be wasteful to perform a brute-force over all
their own documents. It would be wasteful to perform a brute-force search over all
documents if you only care about 1 user at a time. So, you can partition the
vector index based on user ID like so:

Expand All @@ -126,7 +127,7 @@ create virtual table vec_documents using vec0(
)
```

Then during a KNN query, you can constrain results to a specific user in the
Then, during a KNN query, you can constrain results to a specific user in the
`WHERE` clause like so:

```sql
Expand Down Expand Up @@ -172,22 +173,22 @@ where headline_embedding match :query

But be careful! over-using partition key columns can lead to over-sharding and
slower KNN queries. As a rule of thumb, make sure that every unique partition
key value has ~100's of vectors associated with it. In the above examples, make
key value has ~100s of vectors associated with it. In the above examples, make
sure that every user has on the magnitude of dozens or hundreds of documents
each, or that every article has dozens or hundreds of articles per day. If they
each, or that there are dozens or, preferably, hundreds of articles per day. If they
don't and you're noticing slow queries, try a more broad partition key value,
like `organization_id` or `published_month`.

A maximum of 4 partition key columns can be declared in a `vec0` virtual table,
but use caution if you find yourself using more than 1. Vectors are sharded
but use caution if you find yourself using more than 1 partition key column. Vectors are sharded
along each unique combination, so over-sharding is more common with more
partition key columns.

### Auxiliary Columns {#aux}

Auxiliary columns store additional unindexed data separate from the internal
vector index. They are meant for larger metadata that will never appear in a
`WHERE` clause of a KNN query, eliminating the need for a separate `JOIN`.
`WHERE` clause of a KNN query, but can be retrieved in the result set without needing a separate `JOIN`.

Auxiliary columns are denoted by a `+` prefix in their column definition, like
so:
Expand Down Expand Up @@ -233,8 +234,7 @@ column. It can appear in the `SELECT` clause of the KNN query, to get the most
relevant raw images.

In general, auxiliary columns are good for large text, blobs, URLs, or other
datatypes that won't be a part of a `WHERE` clause of a KNN query. If you column
will often appear in a `SELECT` clause but not the `WHERE` clause, then
auxiliary columns are a good fit.
datatypes that won't be a part of a `WHERE` clause of a KNN query. Auxiliary columns are a good fit for columns
that will appear often in a `SELECT` clause but not in the `WHERE` clause.

A maximum of 16 auxiliary columns can be declared in a `vec0` virtual table.

0 comments on commit f93bc5b

Please sign in to comment.