Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SearchIndex and VectorSearchIndex #264

Open
wants to merge 48 commits into
base: main
Choose a base branch
from

Conversation

WaVEV
Copy link
Collaborator

@WaVEV WaVEV commented Mar 3, 2025

Search Indexes and Vector search indexes

This PR introduces new index classes to encapsulate the definitions and details of Atlas indexes.

Notes.

I have to create my own action in order to deploy MongoDB Atlas local image. Maybe we have to include in Mongo forks.

@WaVEV WaVEV requested review from timgraham and Jibola March 3, 2025 02:45
@@ -127,6 +127,38 @@ def _isnull_operator(a, b):
"iregex": lambda a, b: regex_match(a, b, insensitive=True),
}

mongo_data_types = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is needs to be different than DatabaseWrapper.data_types but we need a comment explaining the distinction.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better organized in the index class or do you see some future use of it in other areas? I don't want to make things too tricky, but perhaps a mapping of data_types values to these search data types would simply this a bit. (e.g. string -> string, int -> number, with some exceptions....)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is only used in atlas. The types that are described in base. So I agree to move this mapping in the index class.

Copy link
Collaborator Author

@WaVEV WaVEV Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the other question, it will work with something like double mapping with exceptions. 🤔 but it should be a function/method instead of a dictionaries. Let me know if you agree with a method for those mappings

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do a double mapping, connection should be passed to the function. So, it is not bad to have a double mapping but it will cost to add a new parameter. I would like to keep this as it is, maybe the dictionary is long but the flow is more simple. (given we don't need to pass connection)

class AtlasSearchIndexTests(TestCase):
# Schema editor is used to create the index to test that it works.
# available_apps = ["indexes"]
available_apps = None # could be removed?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not currently that helpful, but I'd include it for future possibilities. Without it, if we try to run Django's test suite without arguments, Django has to flush all test apps between each test (see 86d5885). If we added available_apps to all of Django's TestCase classes, then it would be feasible to run the Django test suite (without any test app names).

@WaVEV WaVEV force-pushed the create-atlas-indexes branch from 03629ae to de3d245 Compare March 8, 2025 03:50
@@ -127,6 +127,38 @@ def _isnull_operator(a, b):
"iregex": lambda a, b: regex_match(a, b, insensitive=True),
}

mongo_data_types = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better organized in the index class or do you see some future use of it in other areas? I don't want to make things too tricky, but perhaps a mapping of data_types values to these search data types would simply this a bit. (e.g. string -> string, int -> number, with some exceptions....)

@WaVEV WaVEV force-pushed the create-atlas-indexes branch 2 times, most recently from 1bf4717 to 7dc04ab Compare March 20, 2025 21:49
@WaVEV WaVEV marked this pull request as ready for review March 20, 2025 23:05
Comment on lines 226 to 237
elif isinstance(
field_,
BooleanField
| IntegerField
| DateField
| DateTimeField
| CharField
| TextField
| UUIDField
| ObjectIdField
| ObjectIdAutoField,
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any documentation you can link to that explains this list?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's here. It is in fields.type table entry. I think copying the exact text:

filter - for fields that contain boolean, date, objectId, numeric, string, or UUID values.

Maybe we could make use the type mapping here? Instead of enumerating all the supporting fields?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems type checking may be more extensible, e.g. for custom fields.

Copy link
Collaborator Author

@WaVEV WaVEV Mar 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 ok, so, checking by the internal type is better and more flexible. Maybe with the double mapping idea or similar.

@@ -27,14 +28,23 @@ class ArrayField(CheckFieldDefaultMixin, Field):
}
_default_hint = ("list", "[]")

def __init__(self, base_field, size=None, **kwargs):
def __init__(self, base_field, size=None, fixed_size=None, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter names don't look clear. I'm inclined to rename size to max_size and fixed_size to size. What do you think? Let's add this feature in a separate PR to precede this feature. I'm thinking first commit: rename size to max_size, second commit: add size.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did part 1 in #273. Want to send a PR to my branch with part 2? There is a bit more to do than you already did (like handle the parameter in deconstruct(), add it to test migrations, support in forms, and update documentation). Hopefully my PR gives hints about the places that need updating.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, shall I add those change also to my PR?

Comment on lines 226 to 237
elif isinstance(
field_,
BooleanField
| IntegerField
| DateField
| DateTimeField
| CharField
| TextField
| UUIDField
| ObjectIdField
| ObjectIdAutoField,
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems type checking may be more extensible, e.g. for custom fields.

Comment on lines +189 to +239
try:
vector_size = int(field_.size)
except (ValueError, TypeError) as err:
raise ValueError("Atlas vector search requires size.") from err
if not isinstance(field_.base_field, FloatField | DecimalField):
raise ValueError("Base type must be Float or Decimal.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of validation really belongs as a system check, since at this point the user has already generated an invalid migration and has to do quite a lot to fix the problem. Unfortunately, the Index.check() hook is missing. Until then, however, I think we can write a system check that iterates over all the models in app_configs and does this validation. Let me know if you run into trouble with that idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am creating the system check, the problem here is: can I import connection. Can I get that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first approach to this is implemented.

Comment on lines +169 to +195
def test_field_size_required(self):
index = VectorSearchIndex(
name="recent_article_idx",
fields=["number_list"],
)
msg = "Atlas vector search requires size."
with self.assertRaisesMessage(ValueError, msg), connection.schema_editor() as editor:
editor.add_index(index=index, model=Article)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing invalid model logic (which can hopefully be done as system checks), see
https://github.com/django/django/blob/922c1c732a47c02aa5ef28b0b1a2bd9bc9b92d87/tests/check_framework/test_model_checks.py

Comment on lines 161 to 204
if isinstance(similarities, str):
self._check_similarity_functions([similarities])
else:
self._check_similarity_functions(similarities)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I like accepting both a string and a list. Did you find a precedent for that? Even if so, self.similarities could at least be stored as a list to simplify future usage. Since you have a custom parameter, you need to add a deconstruct() method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the pattern:
single value applies to all elements, and a list assigns values one-to-one.
Like:
border parameter of expand in pillow
But I can make it one to one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize that, so I'll look again.

@timgraham timgraham changed the title Create atlas indexes Add SearchIndex and VectorSearchIndex Mar 23, 2025
Comment on lines 82 to 89
def test_field_not_exists(self):
index = SearchIndex(
name="recent_article_idx",
fields=["headline", "non_existing_name"],
)
msg = "Article has no field named 'non_existing_name'"
with self.assertRaisesMessage(FieldDoesNotExist, msg), connection.schema_editor() as editor:
editor.add_index(index=index, model=Article)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test may be unnecessary since it's already validated by a system check:

search_indexes.Article: (models.E012) 'indexes' refers to the nonexistent field 'headline'.

@WaVEV WaVEV force-pushed the create-atlas-indexes branch from 60d49de to 2865e13 Compare March 25, 2025 03:05
Comment on lines 6 to 10
from django.db.models import (
DecimalField,
FloatField,
Index,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Items were removed, so this can be a single line.

@@ -101,6 +109,156 @@ def where_node_idx(self, compiler, connection):
return mql


def get_pymongo_index_model(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is added redundantly after the rebase.

@@ -17,10 +17,12 @@
"default": {
"ENGINE": "django_mongodb_backend",
"NAME": "djangotests",
"OPTIONS": {"directConnection": True},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment explaining this parameter is necessary against local atlas instances running in docker.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker site don't provide much info about it. Just give the connection uri format.
I think is to connect to a single MongoDB instance like standalone mode. But need to confirm that.

@@ -27,13 +28,21 @@ class ArrayField(CheckFieldDefaultMixin, Field):
}
_default_hint = ("list", "[]")

def __init__(self, base_field, size=None, **kwargs):
def __init__(self, base_field, max_size=None, size=None, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is just meant for vector_searching then can we have this be "fixed_size=True" rather than max_size? Then the validator would just use a ternary between ArrayMaxLengthValidator and LengthValidator. I think that may make understanding the difference easier to intuit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array is not only for vector search, is for general purpose. Like array of array. Or array of char field or array of embedded fields

{
"type": "vector",
"numDimensions": vector_size,
"similarity": next(similarities),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure why similarities needs to be a list?
I saw earlier that it was one value per field, but I'm still not sure why we can't enforce just one index type as the similarity?

Another NIT: this mapping doesn't include optional quantization. I don't think will be necessary until we support BSONVector, but just wanted to note it here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the code is a bit weird. Maybe it's not easy to read.
The idea here is that we can have a vector index with more than one vector, right? Like creating an index with two vectors—one using cosine similarity and the other using dot product. So the idea is that if you define multiple vector fields, you can assign them different similarity functions. If a single string is passed as the similarity function, it will be applied to all vectors.

@WaVEV WaVEV force-pushed the create-atlas-indexes branch from b992aa7 to 977da85 Compare March 27, 2025 04:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants