Add MLXFoundationModels: an MLX-backed FoundationModels LanguageModel#334
Add MLXFoundationModels: an MLX-backed FoundationModels LanguageModel#334ctymoszek wants to merge 5 commits into
Conversation
MLXLanguageModel conforms to FoundationModels.LanguageModel, so locally-run MLX models can be used through the FoundationModels framework. It supports chat, tool calling, and guided generation (JSON-schema-constrained sampling via a vendored copy of xgrammar). The adapter is gated behind two default-on SwiftPM traits (FoundationModelsIntegration, GuidedGenerationSupport) and is @available(iOS 27.0, macOS 27.0, visionOS 27.0). Platform floors are unchanged from upstream (.macOS(.v14), .iOS(.v17), .tvOS(.v17), .visionOS(.v1)), so existing consumers and anyone who disables the traits are unaffected.
There was a problem hiding this comment.
It would be great to group these into one or more directories for organization purposes.
There was a problem hiding this comment.
Not really for this file, but anchoring it here in the GuidedGeneration directory. I think it would be nice if we had GuidedGeneration and FoundationModels as two separate libraries. FM could depend on GuidedGeneration.
I think most of the GuidedGeneration code doesn't need to be in MLXLMCommon -- e.g. the Embedders do not need it. That isn't to say it can't go there, but it may make more sense in GuidedGeneration if that is its own library.
A split like this would allow people to opt-in to GuidedGeneration without using any of the FM code. For example on Linux the FM code won't be available at all but the GuidedGeneration piece would be interesting.
| mask.needsApply | ||
| ? UnsafeRawPointer(buffer.baseAddress!).assumingMemoryBound(to: UInt32.self) | ||
| : nil | ||
| return applyMaskAndSample( |
There was a problem hiding this comment.
This reads from logits (via item()) before the asyncEvals below are executed. This may cause a stall where we wait for the current graph to finish before scheduling any new work. Consider the normal eval loop:
// save current value -- this will be returned
let previousY = y
// compute the next state and async eval the next token
let token = step(previous: previousY)
y = .init(tokens: token)
asyncEval(token)
tokenCount += 1
return previousY.tokens.item(Int.self)tokens is still an MLXArray (lazy) at the point where asyncEval is called.
| // Wait for GPU to finish (may already be done) | ||
| eval(logits) | ||
| } | ||
| } |
There was a problem hiding this comment.
The normal eval loop has this:
// Apply dynamic cache quantization after each step
maybeQuantizeKVCache(
cache: &cache,
kvBits: kvBits,
kvGroupSize: kvGroupSize,
quantizedKVStart: quantizedKVStart
)Do we need it here inside the generate loop? Not every model uses it, but I think it is important for memory use in models that do.
| let maskArray = bitmaskToMLXArray( | ||
| maskPtr, maskBitCount: vocabSize, totalCount: logitDim) |
There was a problem hiding this comment.
I wonder if this can be cached in the eval loop and passed in?
| #if FoundationModelsIntegration | ||
| #if canImport(FoundationModels, _version: 2) | ||
|
|
||
| import Foundation |
There was a problem hiding this comment.
This indentation isn't my favorite. What do you think about doing this in .swift-format?
diff --git a/.swift-format b/.swift-format
index 8892e9f..b9a726a 100644
--- a/.swift-format
+++ b/.swift-format
@@ -4,4 +4,5 @@
"spaces": 4
},
"spacesAroundRangeFormationOperators": true,
+ "indentConditionalCompilationBlocks": false,
}There was a problem hiding this comment.
I'm in favor, that would look nicer - but maybe that change should be separate from this PR since it affects one other file (WiredMemoryPolicies.swift)
| let (whitespaceBias, whitespaceTokenIDs) = | ||
| WhitespaceTokenBias.compute( | ||
| tokenizer: context.tokenizer | ||
| ) |
There was a problem hiding this comment.
This is computed per response -- should it be held in the ModelCache?
I think WhitespaceTokenBias and also friends.
| /// Prevents race conditions when multiple concurrent requests try to load the model. | ||
| /// Supports caching multiple models by their identifiers. | ||
| private actor ModelCache { | ||
| private var containers: [String: ModelContainer] = [:] |
There was a problem hiding this comment.
A few of these can grow without bound, e.g. in a long running model server. evictAll() / evictAllModels() can remove them but that removes everything. I wonder if this needs finer grained eviction support?
| /// Evicts all cached models, tokenizers, and constraint templates. | ||
| /// Frees GPU memory held by model weights. Subsequent requests will | ||
| /// reload models from disk cache. | ||
| static func evictAllModels() async { |
There was a problem hiding this comment.
Should this be public?
I guess similar question for each of these static functions -- should they be public?
| /// | ||
| /// Thread safety: marked `@unchecked Sendable` because all access is serialized | ||
| /// through `ModelContainer.perform`. | ||
| public struct CompositeLogitProcessor: LogitProcessor, @unchecked Sendable { |
There was a problem hiding this comment.
Should this be @unchecked Sendable? LogitProcessor is not Sendable and the processors can be stateful -- I don't think this is correct.
…ersion: 2) The test target referenced adapter-only symbols (MLXLanguageModel, TranscriptConverter, SchemaConverter, DevelopmentCustomizer, LoadedModelContext, FinalAnswerTool, ...) behind only the FoundationModelsIntegration trait -- or, in a few files, behind no gate at all -- while the source defines those symbols behind both the trait and canImport(FoundationModels, _version: 2). On SDKs predating the v2 FoundationModels (e.g. the macOS 26 SDK) the adapter is correctly compiled out, but the tests still referenced it, failing the build. Mirror the source's gate in the tests so the test surface tracks the adapter surface exactly. Also corrects two files gated on the wrong axis: StopTokenRegressionTests and ToolCallingSchemaTests were under GuidedGenerationSupport, but their DevelopmentCustomizer / SchemaConverter dependencies require the FM + SDK gate.
The two probe bodies that touch the v2-only adapter surface (LanguageModelCapabilities, any LanguageModel, MLXLanguageModel) were behind a bare `canImport(FoundationModels)`, which is true on the 26 SDK (FM 1.x) where those symbols don't yet exist — it would only compile because the target is built against the 27 SDK. Gate them on `canImport(FoundationModels, _version: 2)` so the condition matches the v2 boundary the symbols actually require. The top-of-file `import FoundationModels` stays bare on purpose: it gates on FM *presence* (26+), which is the correct boundary for the absent-tier probe to still compile where FoundationModels does not exist at all.
AvailabilityTests, MLXLanguageModelTests, and TranscriptConverterTests had no top-level #if before; wrapping their bodies in the `FoundationModelsIntegration && canImport(FoundationModels, _version: 2)` gate left the contents at column 0. swift-format's indentConditionalCompilationBlocks (default true) requires content inside a #if to be indented, so the CI lint step reflows them. Indent (and line-wrap the few lines that then exceed 100 cols) to match the #if-indentation style used throughout the source.
MLXFoundationModels is gated on the FoundationModels v2 SDK (canImport(FoundationModels, _version: 2)); its DocC catalog references adapter symbols that don't exist on the SDK the CI runner builds against, so `generate-documentation --warnings-as-errors` fails. Disable the step (if: false) to unblock CI. Re-enable before merge once doc generation builds against the FoundationModels SDK, or verify-docs.sh skips the FM target when v2 is absent.
MLXLanguageModel conforms to FoundationModels.LanguageModel, so locally-run MLX models can be used through the FoundationModels framework. It supports chat, tool calling, and guided generation (JSON-schema-constrained sampling via a vendored copy of xgrammar).
The adapter is gated behind two default-on SwiftPM traits (FoundationModelsIntegration, GuidedGenerationSupport) and is @available(iOS 27.0, macOS 27.0, visionOS 27.0). Platform floors are unchanged from upstream (.macOS(.v14), .iOS(.v17), .tvOS(.v17), .visionOS(.v1)), so existing consumers and anyone who disables the traits are unaffected.