Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(server): library refresh go brrr #14456

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Conversation

etnoy
Copy link
Contributor

@etnoy etnoy commented Dec 2, 2024

For a library we currently queue one job per library asset to check if it is still online. It is more efficient to create one job per 10k assets instead, making it a tighter loop inside the library service than to create 10k tiny jobs

Other things changed:

  • Checks against import paths and exclusion patterns are done in a single db call for the whole library
  • Imports are much quicker due to a removed db call
  • More e2e tests for handling when offline files go back online, and one major bug was found in that code (fixed!)

A quick performance test for an external library with 22573 assets on my 32-core server got the following stats for a rescan

  • This PR: 22k jobs queued, 53s
  • Main: 44k jobs queued, 1m 22s

This performance improvement is important since libraries rarely change and a library rescan is run quite often.

@etnoy etnoy force-pushed the feat/inline-offline-check branch from 80aa615 to 8ecde3b Compare December 2, 2024 21:46
Copy link
Contributor

@mertalev mertalev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice start! I think there are still a lot of untapped potential improvements here.

}

private async handleSyncAsset(id: string, importPaths: string[], exclusionPatterns: string[]): Promise<JobStatus> {
const asset = await this.assetRepository.getById(id);
if (!asset) {
return JobStatus.SKIPPED;
}

const markOffline = async (explanation: string) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just log directly without this function since the offline status will be set at the batch level at the end.

@OnJob({ name: JobName.LIBRARY_SYNC_ASSETS, queue: QueueName.LIBRARY })
async handleSyncAssets(job: JobOf<JobName.LIBRARY_SYNC_ASSETS>): Promise<JobStatus> {
for (const id of job.ids) {
await this.handleSyncAsset(id, job.importPaths, job.exclusionPatterns);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fetch all the assets first with getByIds. Group the ones to be marked offline, queued for metadata extraction, etc. as you check them, then do batched async calls at the end as needed. The try/catch for stat should still be scoped to the asset so one error won't torpedo the batch.

@mertalev
Copy link
Contributor

mertalev commented Dec 4, 2024

The update to fileCreatedAt, fileModifiedAt and originalFileName is unnecessary and can be handled in metadata extraction since this will be queued anyway. This makes the batched update for isOffline and deletedAt simpler since there'll be no values that are unique to each asset.

@etnoy
Copy link
Contributor Author

etnoy commented Dec 8, 2024

Thanks for your comments @mertalev ! I'll first attempt to do the import path and exclusion pattern checks in SQL and then move to your suggestions

@etnoy etnoy force-pushed the feat/inline-offline-check branch 2 times, most recently from d394654 to 8b2a48c Compare December 9, 2024 21:34
@etnoy etnoy force-pushed the feat/inline-offline-check branch 3 times, most recently from 6d69307 to c26f6aa Compare December 10, 2024 16:41
@etnoy etnoy force-pushed the feat/inline-offline-check branch from c26f6aa to a3be620 Compare December 10, 2024 20:39
@etnoy etnoy changed the title feat(server): run all offline checks in a single job feat(server): library refresh go brrr Dec 10, 2024
.where({ isOffline: false })
.andWhere(
new Brackets((qb) => {
qb.where('originalPath NOT SIMILAR TO :paths', {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use LIKE instead of SIMILAR TO.

The exclusions and import paths are also specific to a particular library, right? So you need to specify the library in the query.

Also, can you generate SQL for this and confirm with EXPLAIN ANALYZE that it uses an index?

.update()
.set({
isOffline: true,
deletedAt: new Date(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status also needs to be set. This is why I don't really like the status field. The same info is stored in multiple places so it's so easy for it to go out of sync like this.

@OnJob({ name: JobName.LIBRARY_SYNC_ASSETS, queue: QueueName.LIBRARY })
async handleSyncAssets(job: JobOf<JobName.LIBRARY_SYNC_ASSETS>): Promise<JobStatus> {
for (const id of job.ids) {
const asset = await this.assetRepository.getById(id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do getByIds in one call

@etnoy etnoy force-pushed the feat/inline-offline-check branch from a3be620 to ef4db4e Compare December 10, 2024 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants