Enhancement: optimize bulk custom field operations for performance #11482

cdjk · 2025-11-27T02:52:25Z

Proposed change

Previously bulk-editing a custom field would result in a sql update being issued in a for loop. This was slow - on the order of seconds per document.

This needs separate operations for bulk_create + bulk_update in order to support mysql/mariadb - django supports bulk_create(update_conflicts=True) for postgres and sqlite.

(all times in seconds)

	time - before change	time - after change
sqlite - 10 documents	26	.14
sqlite - 100 documents	124	.15
postgres - 10 documents	59	.15
postgres - 100 documents	199	.21
mariadb - 10 documents	19	.14
mariadb - 100 documents	209	.16

I used AI to write:

the performance testing code
some of the doclink reflection bulk updates.

Type of change

Bug fix: non-breaking change which fixes an issue.
New feature / Enhancement: non-breaking change which adds functionality. Please read the important note above.
Breaking change: fix or feature that would cause existing functionality to not work as expected.
Documentation only.
Other. Please explain: performance improvement - no functional change

Checklist:

I have read & agree with the contributing guidelines.
If applicable, I have included testing coverage for new code in this PR, for backend and / or front-end changes.
If applicable, I have tested my code for breaking changes & regressions on both mobile & desktop devices, using the latest version of major browsers.
If applicable, I have checked that all tests pass, see documentation.
I have run all pre-commit hooks, see documentation.
I have made corresponding changes to the documentation as needed.
In the description of the PR above I have disclosed the use of AI tools in the coding of this PR.

github-actions · 2025-11-27T02:52:39Z

Hello @cdjk,

Thank you very much for submitting this PR to us!

This is what will happen next:

CI tests will run against your PR to ensure quality and consistency.
Next, human contributors from paperless-ngx review your changes.
Please address any issues that come up during the review as soon as you are able to.
If accepted, your pull request will be merged into the dev branch and changes there will be tested further.
Eventually, changes from you and other contributors will be merged into main and a new release will be made.

You'll be hearing from us soon, and thank you again for contributing to our project.

codecov · 2025-11-27T02:56:09Z

❌ 1 Tests Failed:

Tests completed	Failed	Passed	Skipped
2600	1	2599	5

View the top 1 failed test(s) by shortest run time

src.documents.tests.test_api_bulk_edit.TestBulkEditAPI::test_bulk_edit_audit_log_enabled_custom_fields

Stack Traces | 0.093s run time

self = <documents.tests.test_api_bulk_edit.TestBulkEditAPI testMethod=test_bulk_edit_audit_log_enabled_custom_fields>

    @override_settings(AUDIT_LOG_ENABLED=True)
    def test_bulk_edit_audit_log_enabled_custom_fields(self):
        """
        GIVEN:
            - Audit log is enabled
        WHEN:
            - API to bulk edit custom fields is called
        THEN:
            - Audit log is created
        """
        LogEntry.objects.all().delete()
        response = self.client.post(
            ".../api/documents/bulk_edit/",
            json.dumps(
                {
                    "documents": [self.doc1.id],
                    "method": "modify_custom_fields",
                    "parameters": {
                        "add_custom_fields": [self.cf1.id],
                        "remove_custom_fields": [],
                    },
                },
            ),
            content_type="application/json",
        )
    
        self.assertEqual(response.status_code, status.HTTP_200_OK)
>       self.assertEqual(LogEntry.objects.filter(object_pk=self.doc1.id).count(), 2)
E       AssertionError: 1 != 2

.../documents/tests/test_api_bulk_edit.py:1667: AssertionError

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

shamoon · 2025-11-27T03:14:44Z

Gotta be honest, the complexity (and size) makes me a pretty nervous about unintended consequences. It also looks like the AI generated part of this is in fact the largest part?

Obviously I haven’t looked that closely but in general I wonder if a more surgical approach might be less risky.

In the meantime can I ask how exactly you generated those tests? The initial numbers seem atypical.

cdjk · 2025-11-27T04:49:45Z

It's real data - I used the document exporter on my paperless install, which has about 2.5k files, then the document importer to load. This matches the performance I noticed, which is why I wanted to fix it.

I'm running on a t3.large. The performance is about the same with postgres on RDS or in docker. I tried it on my mac (m4 pro) with just postgres, and got:

number of documents	time bulk edit a field
10	5
20	9
50	20
100	40
200	68

There's shell script that calls starts the container using the official docker compose files (with an override to specify my locally built image), and a manage.py command that bulk edits a field. Happy to share those, or check them in if there's a good spot. I wrote the shell script, AI wrote the manage.py command (but it's simple).

I'm happy to revert the doclink reflection code. Editing 100 or so custom fields at a time with a document links seems contrived, but bulk setting other custom fields, especially when loading up a bunch of things, feels fairly common.

If there's something obvious I'm missing in my setup that would speed things up I'm happy to do that instead.

shamoon · 2025-11-27T04:57:36Z

Cool thanks. It’s interesting, not my experience but I’ll have to play with it.

But I am sure there is lots of room to improve and we’d be happy to realize those gains. I suppose I’d be curious if the first part (not the reflect stuff) is enough to see the significant reduction. I do wonder if that might strike a better balance of performance vs “risk”. We’ve come to accept paperless is just kinda complicated, sometimes even small changes have surprising consequences…

- Replace per-document update_or_create loop with bulk operations (bulk_create + bulk_update) for better performance. Can't use bulk_create(update_conflicts=True) because that only works on PostgreSQL/SQLite. - Do not optimize doclinks for now - too complex - Add warning log when attempting to add a non-existent custom field.

sonarqubecloud · 2025-11-28T02:26:55Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

shamoon · 2025-11-28T05:04:22Z

Thanks, does seem much cleaner already

cdjk · 2025-11-30T13:38:03Z

We’ve come to accept paperless is just kinda complicated

You are absolutely right. My fix bypassed the real problem, which is in the how storage paths and filename formats interact.

The performance issue only happens when setting a custom field when a storage path is set. When a storage path is set, the post_save signal when a Document is changed calls generate_unique_filename, which then calls generate_filename in a while loop with a counter to find a filename that doesn't exist yet.

This generates filenames of the format storagage_path-N, with N starting at 0 for each document, and keeps incrementing N until it finds a unique name. Each one of these calls calls the jinja2 templating logic. In my data, file_handling.py:generate_filename gets called about 30k times. I should have used the profiler earlier.

My attempted fix to bulk_create or bulk_update hid the underlying issue because django bulk updates and creates bypass those signals by not calling .save() on each model.

I'm going to stop using storage paths for now. Some quick tests confirm that makes everything a lot faster. I will take a look at fixing the underlying issue because it bothers me, but it might take some time. And even then, I'm not sure the fix in this PR is correct because it bypasses the post_save signal. On the other hand, there other other bulk operations in bulk_edit.py that bypass those signals, so I'm not sure if it matters. I'm open to thoughts on what the correct solution is.

github-actions bot added backend non-trivial Requires approval by several team members enhancement New feature or enhancement labels Nov 27, 2025

shamoon changed the title ~~Optimize bulk custom field operations for performance~~ Enhancement: optimize bulk custom field operations for performance Nov 27, 2025

cdjk force-pushed the speed-up-bulk-custom-fields branch from b2c5b40 to 025562e Compare November 28, 2025 02:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Enhancement: optimize bulk custom field operations for performance #11482

Enhancement: optimize bulk custom field operations for performance #11482

Uh oh!

cdjk commented Nov 27, 2025

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

codecov bot commented Nov 27, 2025 •

edited

Loading

Uh oh!

shamoon commented Nov 27, 2025

Uh oh!

cdjk commented Nov 27, 2025

Uh oh!

shamoon commented Nov 27, 2025

Uh oh!

sonarqubecloud bot commented Nov 28, 2025

Uh oh!

shamoon commented Nov 28, 2025

Uh oh!

cdjk commented Nov 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Enhancement: optimize bulk custom field operations for performance #11482

Are you sure you want to change the base?

Enhancement: optimize bulk custom field operations for performance #11482

Uh oh!

Conversation

cdjk commented Nov 27, 2025

Proposed change

Type of change

Checklist:

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

codecov bot commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 1 Tests Failed:

Uh oh!

shamoon commented Nov 27, 2025

Uh oh!

cdjk commented Nov 27, 2025

Uh oh!

shamoon commented Nov 27, 2025

Uh oh!

sonarqubecloud bot commented Nov 28, 2025

Quality Gate passed

Uh oh!

shamoon commented Nov 28, 2025

Uh oh!

cdjk commented Nov 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Nov 27, 2025 •

edited

Loading