[Option] Parallelize preconditioners across ranks #94

luciaquirke · 2025-12-14T22:03:46Z

More VRAM efficient variant where preconditioners can be spread across an arbitrary number of nodes to compute large outer products. This is useful because preconditioners are often applied to a query and then the query is run across a large dataset, so slow but VRAM-efficient preconditioner computation and usage is a scalable pattern.

Because the preconditioners don't necessarily fit on a single GPU we use GLOO to do distributed CPU operations.

save

133ec0e

luciaquirke force-pushed the trackstar-run branch 3 times, most recently from ea0996a to 369dc0d Compare December 15, 2025 00:08

Enable FSDP across nodes with START_RANK

063b198

luciaquirke force-pushed the trackstar-run branch from 369dc0d to 063b198 Compare December 15, 2025 05:50

luciaquirke added 4 commits December 15, 2025 22:27

Remove final dist barrier

51e1b08

fix tests

88a11a0

add

ef4c58c

Comment out unused code

b222a33

luciaquirke closed this Dec 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Option] Parallelize preconditioners across ranks #94

[Option] Parallelize preconditioners across ranks #94

Uh oh!

luciaquirke commented Dec 14, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Option] Parallelize preconditioners across ranks #94

[Option] Parallelize preconditioners across ranks #94

Uh oh!

Conversation

luciaquirke commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

luciaquirke commented Dec 14, 2025 •

edited

Loading