LoRA: runtime toggle and PEFT adapter loader#316
Conversation
| if let bias { return y + bias } | ||
| return y |
There was a problem hiding this comment.
I wonder if this should call super? Linear is simple enough and unlikely to change, but since it is a subtype it might be better to call that way. It looks like LoRALinear does it that way -- I think that is the pattern to follow
There was a problem hiding this comment.
Yes you're right, it can and updated accordingly.
| // Conversion: | ||
| // - Strip the leading `base_model.model.` prefix. | ||
| // - Rename `.lora_A.weight` -> `.lora_a`, `.lora_B.weight` -> `.lora_b`. | ||
| // - Transpose both tensors to match MLX's [in, r] / [r, out] convention. |
There was a problem hiding this comment.
I wonder if this block comment should be on fromPEFT? As it is you can only see it in the code, not in the built docs.
There was a problem hiding this comment.
Good catch. It's moved to fromPEFT.
| // Match "<encoder|model>.layers.<n>." then return the rest. This | ||
| // matches the two common backbone layouts the project uses. | ||
| let parts = path.split(separator: ".", omittingEmptySubsequences: false) | ||
| for i in 0 ..< (parts.count - 2) { |
There was a problem hiding this comment.
if parts.count is 0 or 1 this will trap -- might want to guard vs that and return nil.
There was a problem hiding this comment.
Good catch. Yes a guard needs to be in place. It's added.
davidkoski
left a comment
There was a problem hiding this comment.
Looks good, thank you!
Proposed changes
This PR adds two related, backward-compatible improvements to the LoRA infrastructure in MLXLMCommon.
Runtime loraEnabled toggle on LoRALayer
A new loraEnabled: Bool property on the LoRALayer protocol lets callers enable or disable the LoRA term at runtime without unloading the adapter. When false, the layer behaves as the underlying base layer (no LoRA term added).
This is needed for inference patterns that interleave LoRA-on and LoRA-off forward passes against the same model — for example, speculative decoding schemes where a LoRA-tuned drafter feeds an un-tuned verifier with a shared KV cache. Today the only way to "disable" a loaded adapter is to unload and reload it, which is too expensive to do per inference step.
Backward compatibility: Strictly additive. The default value is true (LoRA always applied, matching pre-PR behavior). External LoRALayer conformers compile unchanged because the protocol-extension default satisfies the new requirement; their toggle is silently a no-op until they opt in by adding their own stored property.
LoRAContainer.fromPEFT(directory:) — load HuggingFace PEFT adapters
A new static loader on LoRAContainer reads adapter directories in the standard HuggingFace peft format:
These changes are extracted from PR #310 where the model's speculative-decoding mode requires per-phase LoRA toggling and the canonical adapter ships in PEFT format.
Checklist
Put an
xin the boxes that apply.pre-commit run --all-filesto format my code / installed pre-commit prior to committing changes