-
Notifications
You must be signed in to change notification settings - Fork 23
Add SearchIndex and VectorSearchIndex #264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
03629ae
to
de3d245
Compare
1bf4717
to
7dc04ab
Compare
60d49de
to
2865e13
Compare
9fdc143
to
15e3450
Compare
b06db74
to
e69da64
Compare
92caf14
to
61b1c05
Compare
00b0323
to
08654ec
Compare
3237ac8
to
f55a410
Compare
3d9e815
to
f4a5b9a
Compare
django_mongodb_backend/indexes.py
Outdated
if field.get_internal_type() == "UUIDField": | ||
return "uuid" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As currently implemented, I believe this should be string:
"UUIDField": "string", |
Separately, maybe it's worth confirming that UUIDField
can't store it values as BSON uuid. Setting DatabaseFeatures.has_native_uuid_field = True
raises the error: ValueError: cannot encode native uuid.UUID with UuidRepresentation.UNSPECIFIED. UUIDs can be manually converted to bson.Binary instances using bson.Binary.from_uuid() or a different UuidRepresentation can be configured. See the documentation for UuidRepresentation for more information.
I don't think I tried any further to make that work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, good catch. I think we could support UUIDFields if we adapt UUIDField.get_db_prep_value and maybe to_python
(have to check); like DurationField was adapted. It seems to be out of scope, we could add another task for that.
django_mongodb_backend/indexes.py
Outdated
if field.get_internal_type() == "EmbeddedModelField": | ||
return "embeddedDocuments" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the type we also will want for EmbeddedModelArrayField, correct? We might check if EmbeddedModelField.db_type()
should be defined and could return embeddedDocuments
.
How about support for ArrayField? https://www.mongodb.com/docs/atlas/atlas-search/field-types/array-type/
And JSONField? (search type="document"?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JSONField is supported if I am not mistaken. It types as object so is mapped as document. We could map ArrayField also. I don't see any cons to define EmbeddedModelField.db_type()
as embeddedDocuments
.
EDIT: the only cons is the last S. All the other types are in singular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supporting ArrayField would be as I explained in here. I will add and if it is not very convincing, we could remove it. Also I noticed that we could have an array of integers or float that not necessarily be an embedded thing (a vector to do a search), also it could be with size. So the current classification won't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Vector search I cannot add the support as simple as Search Indexes. It might need something that separates filter fields from vector fields. It will change the class sign. It is not very difficult but I suggest to work on that in a next iteration.
django_mongodb_backend/indexes.py
Outdated
id=f"{self._error_id_prefix}.E002", | ||
) | ||
) | ||
if not isinstance(field_.base_field, FloatField | DecimalField): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is isinstance()
definitely what we want here as opposed to say checking db_type()
? (That would be double and decimal for those fields.) On the one hand, an error message that references FloatField & DecimalField probably covers most common use cases and is going to be easier to understand that referencing double/decimal. Just wondering if you had any thoughts on this.
Is this the relevant documentation for this restriction? https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-type/#about-the-vector-type I see double there but it's not clear to me how DecimalField qualifies. I'm not sure what the doucmentation means when it mentions:
- BSON BinData vector subtype float32
- BSON BinData vector subtype int1
- BSON BinData vector subtype int8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DecimalField is not used for that, is Integer field. But Given that not all the similarities are supported for int1. Shall we support only float32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, maybe we ask the team.
django_mongodb_backend/indexes.py
Outdated
) | ||
else: | ||
field_type = field_.db_type(connection) | ||
search_type = self.search_index_data_types(field_, field_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The linked doc says, "You can filter on boolean, date, objectId, numeric, string, and UUID values, including arrays of these types."
Does this account for arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, will try to add some light over this behaviours. The array could be defined in the left hand side
if I try to make a query like:
db.t.t.aggregate([
{
"$vectorSearch": {
"index": "vector_index",
"filter": {
"name": ["example", "example3"]
},
"path": "values",
"queryVector": [0,0,0,0,0],
"numCandidates": 150,
"limit": 10,
"quantization": "scalar"
}
}])
I got the following error:
MongoServerError[UnknownError]: PlanExecutor error during aggregation :: caused by :: "filter" must be a boolean, objectId, number, string, date, uuid, or null
So I cannot pass an array as a filter.
But I can define a documents like:
db.t.t.insertOne({
name: "example",
values: [1.23, 4.56, 7.89, 0.12,3.45]
})
db.t.t.insertOne({
name: ["example", "example2"],
values: [1.23, 4.56, 7.89, 0.12,3.45]
})
Then I filter by:
db.t.t.aggregate([
{
"$vectorSearch": {
"index": "vector_index",
"filter": {
"name": "example2"
},
"path": "values",
"queryVector": [0,0,0,0,0],
"numCandidates": 150,
"limit": 10,
"quantization": "scalar"
}
}])
I will get the second and if I filter by example I will get both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But indeed we could support array of those types.
21c9047
to
0d6f719
Compare
name: Django Test Suite | ||
runs-on: ubuntu-latest | ||
steps: | ||
- name: Checkout django-mongodb-backend | ||
uses: actions/checkout@v4 | ||
with: | ||
persist-credentials: false | ||
- name: install django-mongodb-backend | ||
run: | | ||
pip3 install --upgrade pip | ||
pip3 install -e . | ||
- name: Checkout Django | ||
uses: actions/checkout@v4 | ||
with: | ||
repository: 'mongodb-forks/django' | ||
ref: 'mongodb-5.1.x' | ||
path: 'django_repo' | ||
persist-credentials: false | ||
- name: Install system packages for Django's Python test dependencies | ||
run: | | ||
sudo apt-get update | ||
sudo apt-get install libmemcached-dev | ||
- name: Install Django and its Python test dependencies | ||
run: | | ||
cd django_repo/tests/ | ||
pip3 install -e .. | ||
pip3 install -r requirements/py3.txt | ||
- name: Copy the test settings file | ||
run: cp .github/workflows/mongodb_settings.py django_repo/tests/ | ||
- name: Copy the test runner file | ||
run: cp .github/workflows/runtests.py django_repo/tests/runtests_.py | ||
- name: Start local Atlas | ||
working-directory: . | ||
run: bash .github/workflows/start_local_atlas.sh mongodb/mongodb-atlas-local:7 | ||
- name: Run tests | ||
run: python3 django_repo/tests/runtests_.py |
Check warning
Code scanning / CodeQL
Workflow does not contain permissions Medium test
fd52b43
to
c59297c
Compare
455f98e
to
05766b4
Compare
Co-authored-by: Tim Graham <[email protected]>
05766b4
to
8804c87
Compare
No description provided.