-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SearchIndex and VectorSearchIndex #264
base: main
Are you sure you want to change the base?
Conversation
django_mongodb_backend/base.py
Outdated
@@ -127,6 +127,38 @@ def _isnull_operator(a, b): | |||
"iregex": lambda a, b: regex_match(a, b, insensitive=True), | |||
} | |||
|
|||
mongo_data_types = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is needs to be different than DatabaseWrapper.data_types
but we need a comment explaining the distinction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better organized in the index class or do you see some future use of it in other areas? I don't want to make things too tricky, but perhaps a mapping of data_types
values to these search data types would simply this a bit. (e.g. string -> string, int -> number, with some exceptions....)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is only used in atlas. The types that are described in base. So I agree to move this mapping in the index class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To the other question, it will work with something like double mapping with exceptions. 🤔 but it should be a function/method instead of a dictionaries. Let me know if you agree with a method for those mappings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do a double mapping, connection should be passed to the function. So, it is not bad to have a double mapping but it will cost to add a new parameter. I would like to keep this as it is, maybe the dictionary is long but the flow is more simple. (given we don't need to pass connection)
tests/indexes_/test_atlas_indexes.py
Outdated
class AtlasSearchIndexTests(TestCase): | ||
# Schema editor is used to create the index to test that it works. | ||
# available_apps = ["indexes"] | ||
available_apps = None # could be removed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not currently that helpful, but I'd include it for future possibilities. Without it, if we try to run Django's test suite without arguments, Django has to flush all test apps between each test (see 86d5885). If we added available_apps
to all of Django's TestCase
classes, then it would be feasible to run the Django test suite (without any test app names).
03629ae
to
de3d245
Compare
django_mongodb_backend/base.py
Outdated
@@ -127,6 +127,38 @@ def _isnull_operator(a, b): | |||
"iregex": lambda a, b: regex_match(a, b, insensitive=True), | |||
} | |||
|
|||
mongo_data_types = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better organized in the index class or do you see some future use of it in other areas? I don't want to make things too tricky, but perhaps a mapping of data_types
values to these search data types would simply this a bit. (e.g. string -> string, int -> number, with some exceptions....)
1bf4717
to
7dc04ab
Compare
django_mongodb_backend/indexes.py
Outdated
elif isinstance( | ||
field_, | ||
BooleanField | ||
| IntegerField | ||
| DateField | ||
| DateTimeField | ||
| CharField | ||
| TextField | ||
| UUIDField | ||
| ObjectIdField | ||
| ObjectIdAutoField, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any documentation you can link to that explains this list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's here. It is in fields.type table entry. I think copying the exact text:
filter - for fields that contain boolean, date, objectId, numeric, string, or UUID values.
Maybe we could make use the type mapping here? Instead of enumerating all the supporting fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems type checking may be more extensible, e.g. for custom fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 ok, so, checking by the internal type is better and more flexible. Maybe with the double mapping idea or similar.
@@ -27,14 +28,23 @@ class ArrayField(CheckFieldDefaultMixin, Field): | |||
} | |||
_default_hint = ("list", "[]") | |||
|
|||
def __init__(self, base_field, size=None, **kwargs): | |||
def __init__(self, base_field, size=None, fixed_size=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter names don't look clear. I'm inclined to rename size
to max_size
and fixed_size
to size
. What do you think? Let's add this feature in a separate PR to precede this feature. I'm thinking first commit: rename size
to max_size
, second commit: add size
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did part 1 in #273. Want to send a PR to my branch with part 2? There is a bit more to do than you already did (like handle the parameter in deconstruct()
, add it to test migrations, support in forms, and update documentation). Hopefully my PR gives hints about the places that need updating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, shall I add those change also to my PR?
django_mongodb_backend/indexes.py
Outdated
elif isinstance( | ||
field_, | ||
BooleanField | ||
| IntegerField | ||
| DateField | ||
| DateTimeField | ||
| CharField | ||
| TextField | ||
| UUIDField | ||
| ObjectIdField | ||
| ObjectIdAutoField, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems type checking may be more extensible, e.g. for custom fields.
try: | ||
vector_size = int(field_.size) | ||
except (ValueError, TypeError) as err: | ||
raise ValueError("Atlas vector search requires size.") from err | ||
if not isinstance(field_.base_field, FloatField | DecimalField): | ||
raise ValueError("Base type must be Float or Decimal.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sort of validation really belongs as a system check, since at this point the user has already generated an invalid migration and has to do quite a lot to fix the problem. Unfortunately, the Index.check()
hook is missing. Until then, however, I think we can write a system check that iterates over all the models in app_configs
and does this validation. Let me know if you run into trouble with that idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am creating the system check, the problem here is: can I import connection. Can I get that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first approach to this is implemented.
def test_field_size_required(self): | ||
index = VectorSearchIndex( | ||
name="recent_article_idx", | ||
fields=["number_list"], | ||
) | ||
msg = "Atlas vector search requires size." | ||
with self.assertRaisesMessage(ValueError, msg), connection.schema_editor() as editor: | ||
editor.add_index(index=index, model=Article) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For testing invalid model logic (which can hopefully be done as system checks), see
https://github.com/django/django/blob/922c1c732a47c02aa5ef28b0b1a2bd9bc9b92d87/tests/check_framework/test_model_checks.py
django_mongodb_backend/indexes.py
Outdated
if isinstance(similarities, str): | ||
self._check_similarity_functions([similarities]) | ||
else: | ||
self._check_similarity_functions(similarities) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I like accepting both a string and a list. Did you find a precedent for that? Even if so, self.similarities
could at least be stored as a list to simplify future usage. Since you have a custom parameter, you need to add a deconstruct()
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the pattern:
single value applies to all elements, and a list assigns values one-to-one.
Like:
border parameter of expand in pillow
But I can make it one to one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize that, so I'll look again.
tests/indexes_/test_atlas_indexes.py
Outdated
def test_field_not_exists(self): | ||
index = SearchIndex( | ||
name="recent_article_idx", | ||
fields=["headline", "non_existing_name"], | ||
) | ||
msg = "Article has no field named 'non_existing_name'" | ||
with self.assertRaisesMessage(FieldDoesNotExist, msg), connection.schema_editor() as editor: | ||
editor.add_index(index=index, model=Article) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test may be unnecessary since it's already validated by a system check:
search_indexes.Article: (models.E012) 'indexes' refers to the nonexistent field 'headline'.
60d49de
to
2865e13
Compare
django_mongodb_backend/indexes.py
Outdated
from django.db.models import ( | ||
DecimalField, | ||
FloatField, | ||
Index, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Items were removed, so this can be a single line.
django_mongodb_backend/indexes.py
Outdated
@@ -101,6 +109,156 @@ def where_node_idx(self, compiler, connection): | |||
return mql | |||
|
|||
|
|||
def get_pymongo_index_model( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is added redundantly after the rebase.
@@ -17,10 +17,12 @@ | |||
"default": { | |||
"ENGINE": "django_mongodb_backend", | |||
"NAME": "djangotests", | |||
"OPTIONS": {"directConnection": True}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a comment explaining this parameter is necessary against local atlas instances running in docker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docker site don't provide much info about it. Just give the connection uri format.
I think is to connect to a single MongoDB instance like standalone mode. But need to confirm that.
@@ -27,13 +28,21 @@ class ArrayField(CheckFieldDefaultMixin, Field): | |||
} | |||
_default_hint = ("list", "[]") | |||
|
|||
def __init__(self, base_field, size=None, **kwargs): | |||
def __init__(self, base_field, max_size=None, size=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is just meant for vector_searching then can we have this be "fixed_size=True" rather than max_size? Then the validator would just use a ternary between ArrayMaxLengthValidator
and LengthValidator
. I think that may make understanding the difference easier to intuit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array is not only for vector search, is for general purpose. Like array of array. Or array of char field or array of embedded fields
{ | ||
"type": "vector", | ||
"numDimensions": vector_size, | ||
"similarity": next(similarities), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure why similarities
needs to be a list?
I saw earlier that it was one value per field, but I'm still not sure why we can't enforce just one index type as the similarity?
Another NIT: this mapping doesn't include optional quantization. I don't think will be necessary until we support BSONVector, but just wanted to note it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the code is a bit weird. Maybe it's not easy to read.
The idea here is that we can have a vector index with more than one vector, right? Like creating an index with two vectors—one using cosine similarity and the other using dot product. So the idea is that if you define multiple vector fields, you can assign them different similarity functions. If a single string is passed as the similarity function, it will be applied to all vectors.
Jib's suggestion Co-authored-by: Jib <[email protected]>
b992aa7
to
977da85
Compare
Search Indexes and Vector search indexes
This PR introduces new index classes to encapsulate the definitions and details of Atlas indexes.
Notes.
I have to create my own action in order to deploy MongoDB Atlas local image. Maybe we have to include in Mongo forks.