Skip to content

Conversation

@cdjk
Copy link
Contributor

@cdjk cdjk commented Nov 27, 2025

Proposed change

Previously bulk-editing a custom field would result in a sql update being issued in a for loop. This was slow - on the order of seconds per document.

This needs separate operations for bulk_create + bulk_update in order to support mysql/mariadb - django supports bulk_create(update_conflicts=True) for postgres and sqlite.

(all times in seconds)

time - before change time - after change
sqlite - 10 documents 26 .14
sqlite - 100 documents 124 .15
postgres - 10 documents 59 .15
postgres - 100 documents 199 .21
mariadb - 10 documents 19 .14
mariadb - 100 documents 209 .16

I used AI to write:

  • the performance testing code
  • some of the doclink reflection bulk updates.

Type of change

  • Bug fix: non-breaking change which fixes an issue.
  • New feature / Enhancement: non-breaking change which adds functionality. Please read the important note above.
  • Breaking change: fix or feature that would cause existing functionality to not work as expected.
  • Documentation only.
  • Other. Please explain: performance improvement - no functional change

Checklist:

  • I have read & agree with the contributing guidelines.
  • If applicable, I have included testing coverage for new code in this PR, for backend and / or front-end changes.
  • If applicable, I have tested my code for breaking changes & regressions on both mobile & desktop devices, using the latest version of major browsers.
  • If applicable, I have checked that all tests pass, see documentation.
  • I have run all pre-commit hooks, see documentation.
  • I have made corresponding changes to the documentation as needed.
  • In the description of the PR above I have disclosed the use of AI tools in the coding of this PR.

@github-actions github-actions bot added backend non-trivial Requires approval by several team members enhancement New feature or enhancement labels Nov 27, 2025
@github-actions
Copy link
Contributor

Hello @cdjk,

Thank you very much for submitting this PR to us!

This is what will happen next:

  1. CI tests will run against your PR to ensure quality and consistency.
  2. Next, human contributors from paperless-ngx review your changes.
  3. Please address any issues that come up during the review as soon as you are able to.
  4. If accepted, your pull request will be merged into the dev branch and changes there will be tested further.
  5. Eventually, changes from you and other contributors will be merged into main and a new release will be made.

You'll be hearing from us soon, and thank you again for contributing to our project.

@codecov
Copy link

codecov bot commented Nov 27, 2025

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
2600 1 2599 5
View the top 1 failed test(s) by shortest run time
src.documents.tests.test_api_bulk_edit.TestBulkEditAPI::test_bulk_edit_audit_log_enabled_custom_fields
Stack Traces | 0.093s run time
self = <documents.tests.test_api_bulk_edit.TestBulkEditAPI testMethod=test_bulk_edit_audit_log_enabled_custom_fields>

    @override_settings(AUDIT_LOG_ENABLED=True)
    def test_bulk_edit_audit_log_enabled_custom_fields(self):
        """
        GIVEN:
            - Audit log is enabled
        WHEN:
            - API to bulk edit custom fields is called
        THEN:
            - Audit log is created
        """
        LogEntry.objects.all().delete()
        response = self.client.post(
            ".../api/documents/bulk_edit/",
            json.dumps(
                {
                    "documents": [self.doc1.id],
                    "method": "modify_custom_fields",
                    "parameters": {
                        "add_custom_fields": [self.cf1.id],
                        "remove_custom_fields": [],
                    },
                },
            ),
            content_type="application/json",
        )
    
        self.assertEqual(response.status_code, status.HTTP_200_OK)
>       self.assertEqual(LogEntry.objects.filter(object_pk=self.doc1.id).count(), 2)
E       AssertionError: 1 != 2

.../documents/tests/test_api_bulk_edit.py:1667: AssertionError

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@shamoon
Copy link
Member

shamoon commented Nov 27, 2025

Gotta be honest, the complexity (and size) makes me a pretty nervous about unintended consequences. It also looks like the AI generated part of this is in fact the largest part?

Obviously I haven’t looked that closely but in general I wonder if a more surgical approach might be less risky.

In the meantime can I ask how exactly you generated those tests? The initial numbers seem atypical.

@cdjk
Copy link
Contributor Author

cdjk commented Nov 27, 2025

It's real data - I used the document exporter on my paperless install, which has about 2.5k files, then the document importer to load. This matches the performance I noticed, which is why I wanted to fix it.

I'm running on a t3.large. The performance is about the same with postgres on RDS or in docker. I tried it on my mac (m4 pro) with just postgres, and got:

number of documents time bulk edit a field
10 5
20 9
50 20
100 40
200 68

There's shell script that calls starts the container using the official docker compose files (with an override to specify my locally built image), and a manage.py command that bulk edits a field. Happy to share those, or check them in if there's a good spot. I wrote the shell script, AI wrote the manage.py command (but it's simple).

I'm happy to revert the doclink reflection code. Editing 100 or so custom fields at a time with a document links seems contrived, but bulk setting other custom fields, especially when loading up a bunch of things, feels fairly common.

If there's something obvious I'm missing in my setup that would speed things up I'm happy to do that instead.

@shamoon
Copy link
Member

shamoon commented Nov 27, 2025

Cool thanks. It’s interesting, not my experience but I’ll have to play with it.

But I am sure there is lots of room to improve and we’d be happy to realize those gains. I suppose I’d be curious if the first part (not the reflect stuff) is enough to see the significant reduction. I do wonder if that might strike a better balance of performance vs “risk”. We’ve come to accept paperless is just kinda complicated, sometimes even small changes have surprising consequences…

@shamoon shamoon changed the title Optimize bulk custom field operations for performance Enhancement: optimize bulk custom field operations for performance Nov 27, 2025
- Replace per-document update_or_create loop with bulk operations
  (bulk_create + bulk_update) for better performance. Can't use
  bulk_create(update_conflicts=True) because that only works on
  PostgreSQL/SQLite.
- Do not optimize doclinks for now - too complex

- Add warning log when attempting to add a non-existent custom field.
@cdjk cdjk force-pushed the speed-up-bulk-custom-fields branch from b2c5b40 to 025562e Compare November 28, 2025 02:25
@sonarqubecloud
Copy link

@shamoon
Copy link
Member

shamoon commented Nov 28, 2025

Thanks, does seem much cleaner already

@cdjk
Copy link
Contributor Author

cdjk commented Nov 30, 2025

We’ve come to accept paperless is just kinda complicated

You are absolutely right. My fix bypassed the real problem, which is in the how storage paths and filename formats interact.

The performance issue only happens when setting a custom field when a storage path is set. When a storage path is set, the post_save signal when a Document is changed calls generate_unique_filename, which then calls generate_filename in a while loop with a counter to find a filename that doesn't exist yet.

This generates filenames of the format storagage_path-N, with N starting at 0 for each document, and keeps incrementing N until it finds a unique name. Each one of these calls calls the jinja2 templating logic. In my data, file_handling.py:generate_filename gets called about 30k times. I should have used the profiler earlier.

My attempted fix to bulk_create or bulk_update hid the underlying issue because django bulk updates and creates bypass those signals by not calling .save() on each model.

I'm going to stop using storage paths for now. Some quick tests confirm that makes everything a lot faster. I will take a look at fixing the underlying issue because it bothers me, but it might take some time. And even then, I'm not sure the fix in this PR is correct because it bypasses the post_save signal. On the other hand, there other other bulk operations in bulk_edit.py that bypass those signals, so I'm not sure if it matters. I'm open to thoughts on what the correct solution is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend enhancement New feature or enhancement non-trivial Requires approval by several team members

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants