Skip to content

Add SearchIndex and VectorSearchIndex #264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 5, 2025

Conversation

WaVEV
Copy link
Collaborator

@WaVEV WaVEV commented Mar 3, 2025

No description provided.

@WaVEV WaVEV requested review from timgraham and Jibola March 3, 2025 02:45
@WaVEV WaVEV force-pushed the create-atlas-indexes branch from 03629ae to de3d245 Compare March 8, 2025 03:50
@WaVEV WaVEV force-pushed the create-atlas-indexes branch 2 times, most recently from 1bf4717 to 7dc04ab Compare March 20, 2025 21:49
@WaVEV WaVEV marked this pull request as ready for review March 20, 2025 23:05
@timgraham timgraham changed the title Create atlas indexes Add SearchIndex and VectorSearchIndex Mar 23, 2025
@WaVEV WaVEV force-pushed the create-atlas-indexes branch from 60d49de to 2865e13 Compare March 25, 2025 03:05
@WaVEV WaVEV force-pushed the create-atlas-indexes branch 2 times, most recently from 9fdc143 to 15e3450 Compare March 31, 2025 03:05
@WaVEV WaVEV force-pushed the create-atlas-indexes branch 3 times, most recently from b06db74 to e69da64 Compare April 9, 2025 06:00
timgraham

This comment was marked as resolved.

@WaVEV WaVEV force-pushed the create-atlas-indexes branch from 92caf14 to 61b1c05 Compare April 12, 2025 16:19
@WaVEV WaVEV force-pushed the create-atlas-indexes branch from 00b0323 to 08654ec Compare April 15, 2025 02:17
@WaVEV WaVEV force-pushed the create-atlas-indexes branch from 3237ac8 to f55a410 Compare April 18, 2025 22:57
@WaVEV WaVEV force-pushed the create-atlas-indexes branch from 3d9e815 to f4a5b9a Compare April 21, 2025 16:53
Comment on lines 133 to 134
if field.get_internal_type() == "UUIDField":
return "uuid"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As currently implemented, I believe this should be string:

Separately, maybe it's worth confirming that UUIDField can't store it values as BSON uuid. Setting DatabaseFeatures.has_native_uuid_field = True raises the error: ValueError: cannot encode native uuid.UUID with UuidRepresentation.UNSPECIFIED. UUIDs can be manually converted to bson.Binary instances using bson.Binary.from_uuid() or a different UuidRepresentation can be configured. See the documentation for UuidRepresentation for more information.

I don't think I tried any further to make that work.

Copy link
Collaborator Author

@WaVEV WaVEV Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good catch. I think we could support UUIDFields if we adapt UUIDField.get_db_prep_value and maybe to_python (have to check); like DurationField was adapted. It seems to be out of scope, we could add another task for that.

Comment on lines 137 to 138
if field.get_internal_type() == "EmbeddedModelField":
return "embeddedDocuments"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the type we also will want for EmbeddedModelArrayField, correct? We might check if EmbeddedModelField.db_type() should be defined and could return embeddedDocuments.

How about support for ArrayField? https://www.mongodb.com/docs/atlas/atlas-search/field-types/array-type/

And JSONField? (search type="document"?)

Copy link
Collaborator Author

@WaVEV WaVEV Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSONField is supported if I am not mistaken. It types as object so is mapped as document. We could map ArrayField also. I don't see any cons to define EmbeddedModelField.db_type() as embeddedDocuments.

EDIT: the only cons is the last S. All the other types are in singular.

Copy link
Collaborator Author

@WaVEV WaVEV Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supporting ArrayField would be as I explained in here. I will add and if it is not very convincing, we could remove it. Also I noticed that we could have an array of integers or float that not necessarily be an embedded thing (a vector to do a search), also it could be with size. So the current classification won't work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Vector search I cannot add the support as simple as Search Indexes. It might need something that separates filter fields from vector fields. It will change the class sign. It is not very difficult but I suggest to work on that in a next iteration.

id=f"{self._error_id_prefix}.E002",
)
)
if not isinstance(field_.base_field, FloatField | DecimalField):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is isinstance() definitely what we want here as opposed to say checking db_type()? (That would be double and decimal for those fields.) On the one hand, an error message that references FloatField & DecimalField probably covers most common use cases and is going to be easier to understand that referencing double/decimal. Just wondering if you had any thoughts on this.

Is this the relevant documentation for this restriction? https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-type/#about-the-vector-type I see double there but it's not clear to me how DecimalField qualifies. I'm not sure what the doucmentation means when it mentions:

  • BSON BinData vector subtype float32
  • BSON BinData vector subtype int1
  • BSON BinData vector subtype int8

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DecimalField is not used for that, is Integer field. But Given that not all the similarities are supported for int1. Shall we support only float32?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, maybe we ask the team.

)
else:
field_type = field_.db_type(connection)
search_type = self.search_index_data_types(field_, field_type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linked doc says, "You can filter on boolean, date, objectId, numeric, string, and UUID values, including arrays of these types."
Does this account for arrays?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, will try to add some light over this behaviours. The array could be defined in the left hand side if I try to make a query like:

db.t.t.aggregate([
  {
    "$vectorSearch": {
      "index": "vector_index",
      "filter": {
        "name": ["example", "example3"]
      },
      "path": "values",
      "queryVector": [0,0,0,0,0],
      "numCandidates": 150,
      "limit": 10,
      "quantization": "scalar"
    }
  }])

I got the following error:
MongoServerError[UnknownError]: PlanExecutor error during aggregation :: caused by :: "filter" must be a boolean, objectId, number, string, date, uuid, or null
So I cannot pass an array as a filter.

But I can define a documents like:

db.t.t.insertOne({
  name: "example",
  values: [1.23, 4.56, 7.89, 0.12,3.45]
})

db.t.t.insertOne({
  name: ["example", "example2"],
  values: [1.23, 4.56, 7.89, 0.12,3.45]
})

Then I filter by:

db.t.t.aggregate([
  {
    "$vectorSearch": {
      "index": "vector_index",
      "filter": {
        "name": "example2"
      },
      "path": "values",
      "queryVector": [0,0,0,0,0],
      "numCandidates": 150,
      "limit": 10,
      "quantization": "scalar"
    }
  }])

I will get the second and if I filter by example I will get both.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But indeed we could support array of those types.

@timgraham timgraham force-pushed the create-atlas-indexes branch from 21c9047 to 0d6f719 Compare April 24, 2025 18:28
Comment on lines 21 to 56
name: Django Test Suite
runs-on: ubuntu-latest
steps:
- name: Checkout django-mongodb-backend
uses: actions/checkout@v4
with:
persist-credentials: false
- name: install django-mongodb-backend
run: |
pip3 install --upgrade pip
pip3 install -e .
- name: Checkout Django
uses: actions/checkout@v4
with:
repository: 'mongodb-forks/django'
ref: 'mongodb-5.1.x'
path: 'django_repo'
persist-credentials: false
- name: Install system packages for Django's Python test dependencies
run: |
sudo apt-get update
sudo apt-get install libmemcached-dev
- name: Install Django and its Python test dependencies
run: |
cd django_repo/tests/
pip3 install -e ..
pip3 install -r requirements/py3.txt
- name: Copy the test settings file
run: cp .github/workflows/mongodb_settings.py django_repo/tests/
- name: Copy the test runner file
run: cp .github/workflows/runtests.py django_repo/tests/runtests_.py
- name: Start local Atlas
working-directory: .
run: bash .github/workflows/start_local_atlas.sh mongodb/mongodb-atlas-local:7
- name: Run tests
run: python3 django_repo/tests/runtests_.py

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium test

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
timgraham

This comment was marked as resolved.

@timgraham timgraham force-pushed the create-atlas-indexes branch from fd52b43 to c59297c Compare April 26, 2025 01:07
@timgraham timgraham force-pushed the create-atlas-indexes branch from 455f98e to 05766b4 Compare May 3, 2025 21:26
@timgraham timgraham force-pushed the create-atlas-indexes branch from 05766b4 to 8804c87 Compare May 4, 2025 23:59
@timgraham timgraham merged commit 8804c87 into mongodb:main May 5, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants