-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Enhancement: optimize bulk custom field operations for performance #11482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
|
Hello @cdjk, Thank you very much for submitting this PR to us! This is what will happen next:
You'll be hearing from us soon, and thank you again for contributing to our project. |
❌ 1 Tests Failed:
View the top 1 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
|
Gotta be honest, the complexity (and size) makes me a pretty nervous about unintended consequences. It also looks like the AI generated part of this is in fact the largest part? Obviously I haven’t looked that closely but in general I wonder if a more surgical approach might be less risky. In the meantime can I ask how exactly you generated those tests? The initial numbers seem atypical. |
|
It's real data - I used the document exporter on my paperless install, which has about 2.5k files, then the document importer to load. This matches the performance I noticed, which is why I wanted to fix it. I'm running on a t3.large. The performance is about the same with postgres on RDS or in docker. I tried it on my mac (m4 pro) with just postgres, and got:
There's shell script that calls starts the container using the official docker compose files (with an override to specify my locally built image), and a manage.py command that bulk edits a field. Happy to share those, or check them in if there's a good spot. I wrote the shell script, AI wrote the manage.py command (but it's simple). I'm happy to revert the doclink reflection code. Editing 100 or so custom fields at a time with a document links seems contrived, but bulk setting other custom fields, especially when loading up a bunch of things, feels fairly common. If there's something obvious I'm missing in my setup that would speed things up I'm happy to do that instead. |
|
Cool thanks. It’s interesting, not my experience but I’ll have to play with it. But I am sure there is lots of room to improve and we’d be happy to realize those gains. I suppose I’d be curious if the first part (not the reflect stuff) is enough to see the significant reduction. I do wonder if that might strike a better balance of performance vs “risk”. We’ve come to accept paperless is just kinda complicated, sometimes even small changes have surprising consequences… |
- Replace per-document update_or_create loop with bulk operations (bulk_create + bulk_update) for better performance. Can't use bulk_create(update_conflicts=True) because that only works on PostgreSQL/SQLite. - Do not optimize doclinks for now - too complex - Add warning log when attempting to add a non-existent custom field.
b2c5b40 to
025562e
Compare
|
|
Thanks, does seem much cleaner already |
You are absolutely right. My fix bypassed the real problem, which is in the how storage paths and filename formats interact. The performance issue only happens when setting a custom field when a storage path is set. When a storage path is set, the post_save signal when a Document is changed calls generate_unique_filename, which then calls generate_filename in a while loop with a counter to find a filename that doesn't exist yet. This generates filenames of the format storagage_path-N, with N starting at 0 for each document, and keeps incrementing N until it finds a unique name. Each one of these calls calls the jinja2 templating logic. In my data, file_handling.py:generate_filename gets called about 30k times. I should have used the profiler earlier. My attempted fix to bulk_create or bulk_update hid the underlying issue because django bulk updates and creates bypass those signals by not calling .save() on each model. I'm going to stop using storage paths for now. Some quick tests confirm that makes everything a lot faster. I will take a look at fixing the underlying issue because it bothers me, but it might take some time. And even then, I'm not sure the fix in this PR is correct because it bypasses the post_save signal. On the other hand, there other other bulk operations in bulk_edit.py that bypass those signals, so I'm not sure if it matters. I'm open to thoughts on what the correct solution is. |



Proposed change
Previously bulk-editing a custom field would result in a sql update being issued in a for loop. This was slow - on the order of seconds per document.
This needs separate operations for bulk_create + bulk_update in order to support mysql/mariadb - django supports bulk_create(update_conflicts=True) for postgres and sqlite.
(all times in seconds)
I used AI to write:
Type of change
Checklist:
pre-commithooks, see documentation.