-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(server): library refresh go brrr #14456
base: main
Are you sure you want to change the base?
Conversation
0eb1440
to
80aa615
Compare
80aa615
to
8ecde3b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice start! I think there are still a lot of untapped potential improvements here.
The update to |
Thanks for your comments @mertalev ! I'll first attempt to do the import path and exclusion pattern checks in SQL and then move to your suggestions |
d394654
to
8b2a48c
Compare
6d69307
to
c26f6aa
Compare
c26f6aa
to
a3be620
Compare
775b817
to
69b273d
Compare
Never thought of that, I've implemented your suggestion. I'm also considering changing the initial import code to ignore file mtime, this allows us to not do any file system calls except for the crawl. Metadata extraction will have to do the heavy lifting instead |
Would that mean you queue them for metadata extraction even if they're unchanged? You can test it but I think it'd be more overhead than the stat calls. Edit: also if you do this with the source set to |
I was referring to new imports, files that are new to immich. I hoped to improve the ingest performance by removing the stat call. After testing, there are two issues:
If we can mitigate the two issues above, I can rewrite the library import feature and do that in batches as well! |
I don't see why fileModifiedAt needs a non-null constraint in the DB. Might just be an oversight that didn't matter because it didn't affect our usage. I think you can change the asset entity and generate a migration to remove that constraint. For sidecar files, maybe you could add |
I might just put new Date() in at the moment to keep the PR somewhat constrained. Regarding sidecars, I have thought about that, problem right now is that we're batching the crawled files in batches of 10k. It might be hard to do get that working alright. Maybe I'll just queue a sidecar discovery for every imported asset for now |
99c8e91
to
d00913b
Compare
…-app/immich into feat/inline-offline-check
250b04c
to
53324da
Compare
447fb26
to
bc1c22d
Compare
bc1c22d
to
f35866a
Compare
.limit(pagination.take + 1) | ||
.offset(pagination.skip ?? 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@etnoy didn't you want to use a stream?
async getLibraryAssetCount(options: AssetSearchOptions = {}): Promise<number | undefined> { | ||
const { count } = await this.db | ||
.selectFrom('assets') | ||
.select(sql`COUNT(*)`.as('count')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still an open comment.
const assetIds: string[] = []; | ||
|
||
for (let i = 0; i < assetImports.length; i += 5000) { | ||
// Chunk the imports to avoid the postgres limit of max parameters at once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still a problem or does kysely already handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the same limitation applies to something like this.db.insertInto('assets').values(assets)
. But it does let us pass the entire array as a single parameter (meaning the 65,535 limit doesn't apply) if it's written differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danieldietzler when I first migrated this PR to kysely I had to batch it here to avoid errors
return JobStatus.SKIPPED; | ||
} | ||
@OnJob({ name: JobName.LIBRARY_SYNC_ASSETS, queue: QueueName.LIBRARY }) | ||
async handleSyncAssets(job: JobOf<JobName.LIBRARY_SYNC_ASSETS>): Promise<JobStatus> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this function (and kind of in general) there is a lot of logic (or at least lines of code I need to read) that are purely added complexity for the sake of logs. I get that logs are neat, but I would personally argue that some of them don't add any value at all. IMO logs should primarily indicate and help understand an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trust me, if you scan or rescan a library with >1M assets you really need these logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we at least reduce them or have more generic logs instead of dedicated if/else structures and variables just for logging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Let me know if you think this is better
23c3994
to
17bd7ec
Compare
17bd7ec
to
cb772ad
Compare
…/inline-offline-check
cb772ad
to
aa689ef
Compare
…/inline-offline-check
9313217
to
d8d61a0
Compare
You call on merging this one @mertalev |
This PR significantly improves library scanning performance. Wherever suitable, we are doing jobs in batches, and many looped database interactions are replaced with SQL queries.
User testimonials
"@etnoy what on earth have you done. I tried your PR and it finished the scan for 1M assets in 37 seconds down from 728s on main. It takes 188s just to finish queuing on main" -- @mertalev
Changes made
Plus several minor cleanups and performance enhancements.
The performance improvements are at least an order of magnitude in library scanning.
Benchmark 1
A library scan with 22k items where nothing has changed since the last scan used to take 1m 22s, now it's below 10 seconds, an improvement of 87 percent!
Benchmark 2
A clean library import with 19k items takes 1m40s in main and 7 seconds in this PR.
NOTE: this benchmark is only the library service scan and does not include the metadata extraction. Also, some fs calls have been migrated from the library service to the metadata service, although this should only have a minor impact on overall scan performance
Benchmark 3
Importing a library with >5M assets.
No need to compare to main, you know it's fast!
Benchmark 4
Importing a library of 527041 files took 1m58s (without metadata extraction) in this PR.
No need to compare to main, you know it's fast!
Bonus:
This scan imports all new files:
This is an "idle scan", where a refresh finds no changes:

Future work:
Final note:
This PR allowed me to hit a milestone of 10M assets in a single Immich instance, likely a world-first. This does require max-old-space-size=8096, but that's to be expected anyway