Skip to content

feat: agentic loop, vision, Gemini/Ollama, structured outputs, prompt caching#3

Open
VDurocher wants to merge 14 commits into
masterfrom
feature/vision-agentic-multimodel
Open

feat: agentic loop, vision, Gemini/Ollama, structured outputs, prompt caching#3
VDurocher wants to merge 14 commits into
masterfrom
feature/vision-agentic-multimodel

Conversation

@VDurocher

Copy link
Copy Markdown
Owner

Summary

5 new features added to the AIAgent protocol and all provider clients.

Feature 1 — Agentic ReAct Loop

  • run(messages:tools:executor:maxSteps:) default impl on AIAgent protocol
  • Automatically executes tool calls in a loop until the model returns a final text response
  • Appends asHistoryMessage (assistant message carrying tool calls) + tool results to history at each step
  • Throws AIError.agentLoopExceeded(steps:) when maxSteps is reached

Feature 2 — Vision / Multi-Modal

  • New AIImageContent enum: .url(URL) and .data(Data, mimeType: String)
  • AIMessage gains an images: [AIImageContent]? field; user(_:images:) factory updated
  • OpenAI: images encoded as data:<mime>;base64,... URLs in a ContentPart array
  • Anthropic: images encoded as source: {type:"url"/"base64", ...} blocks
  • Gemini: images encoded as inline_data: {mime_type, data} parts

Feature 3 — Google Gemini + Ollama

  • New GeminiClient actor: generateContent, streamGenerateContent (SSE with alt=sse), function calling
  • API key passed as ?key= query parameter (not a header)
  • GeminiScalar Codable enum for outbound function call args; GeminiArgValue Decodable for inbound
  • Ollama reuses OpenAIClient with http://localhost:11434/v1 — zero extra code
  • AIConfiguration.validate() skips API key check for Ollama
  • 12 new convenience initializers (Gemini + Ollama + Claude 3.7/4.x + GPT-4.1)

Feature 4 — Structured Outputs

  • send<T: Decodable & Sendable>(messages:as:) generic method on AIAgent protocol
  • sendForJSON() protocol hook overridden per provider for native JSON mode
    • OpenAI: response_format: {type: "json_object"}
    • Gemini: responseMimeType: "application/json" in generationConfig
    • Anthropic: system prompt injection (no native JSON mode)
  • Strips ```json ... ``` markdown fences before decoding

Feature 5 — Anthropic Prompt Caching

  • AIMessage gains cacheControl: Bool; system(_:cached:) factory updated
  • AnthropicClient detects cached messages and adds anthropic-beta: prompt-caching-2024-07-31 header
  • SystemContent supports both string format (no cache) and blocks array (with cache_control)
  • ContentBlock encodes cache_control: {type: "ephemeral"} on text and image blocks when flagged

Bug Fixes

  • Bool must precede Int/Double in Any-based switch statements to avoid NSNumber ambiguity when decoding JSON booleans via JSONSerialization — fixed in both AnthropicScalar.from and GeminiScalar.from

Test plan

  • swift build passes with zero errors and zero warnings under Swift 6.0 strict concurrency
  • OpenAI: send a message, stream a message, send with tools, send with images (GPT-4o)
  • Anthropic: send a message, stream, tool use, prompt caching (Claude Sonnet 4.6)
  • Gemini: send a message, stream, tool use, JSON mode (Gemini 2.0 Flash)
  • Ollama: ollamaLlama32() connects to local server, send a message
  • Agentic loop: verify multi-step tool call resolves correctly within maxSteps
  • Structured outputs: send(messages:as:) decodes a known JSON fixture
  • Prompt caching: confirm anthropic-beta header is present when cacheControl: true
  • Vision: attach JPEG data to a GPT-4o and Claude message, verify non-nil response

…ompt caching

Feature 1 — Agentic ReAct loop
- run(messages:tools:executor:maxSteps:) default impl in AIAgent protocol
- Auto-executes tool calls until model returns a final text response
- Throws agentLoopExceeded when maxSteps is reached

Feature 2 — Vision multi-modal
- New AIImageContent enum (.url / .data) with Sendable+Codable conformances
- AIMessage gains images field; user(_:images:) factory updated
- OpenAI: ContentPart encodes images as data URLs (base64)
- Anthropic: ContentBlock.image encodes URL and base64 sources
- Gemini: GeminiPart.inlineData encodes base64 bytes

Feature 3 — Google Gemini + Ollama providers
- New GeminiClient actor with generateContent / streamGenerateContent
- API key passed as ?key= query param; SSE streaming with alt=sse
- GeminiScalar Codable enum for outbound function call args encoding
- GeminiArgValue Decodable enum for inbound function call args decoding
- Ollama reuses OpenAIClient with http://localhost:11434/v1 base URL
- AIConfiguration.validate() skips API key check for Ollama provider

Feature 4 — Structured outputs
- send<T: Decodable>(messages:as:) generic method on AIAgent protocol
- sendForJSON() protocol hook overridden per provider for native JSON mode
- OpenAI: response_format {type:"json_object"}
- Gemini: responseMimeType "application/json" in generationConfig
- Anthropic: falls back to system prompt injection (no native JSON mode)
- Strips ```json ... ``` markdown fences before decoding

Feature 5 — Anthropic prompt caching
- AIMessage gains cacheControl: Bool field; system(_:cached:) factory updated
- AIMessageWithTools.asHistoryMessage carries tool calls for history encoding
- AnthropicClient detects cacheControl messages and adds anthropic-beta header
- SystemContent supports both string and blocks format for cached system messages
- ContentBlock encodes cache_control: {type:"ephemeral"} when cached

Bug fix: Bool must precede Int/Double in Any-based switch to avoid NSNumber
ambiguity when decoding JSON booleans via JSONSerialization (both clients).
- Add Gemini, Ollama, vision, agentic loop, structured outputs, prompt caching
- Full convenience initializer list for all 20+ supported models
- New sections: Agentic Loop, Vision, Gemini, Ollama, Structured Outputs, Prompt Caching
- Updated architecture diagram and API reference
- Updated SwiftAIAgentCore.swift umbrella doc comments
…cancellation

SEC-02: Validate MIME type in AIImageContent against allowlist
(image/jpeg, image/png, image/gif, image/webp) — throws AIError.invalidContext
on unknown types

SEC-03: Enforce 20 MB max size on inline image data — prevents OOM on large inputs

SEC-05: Cap SSE buffer at 1 MB per message in NetworkClient.stream()
— throws AIError.streamingError on overflow, preventing unbounded memory growth
from malformed or malicious SSE streams

SEC-06: Truncate HTTP error body to 500 chars in NetworkClient.execute()
— prevents leaking large or sensitive provider error responses

SEC-08: Validate tool names against the declared tool list in the agentic loop
— throws AIError.invalidContext if the model requests an unknown tool

SEC-09: Add onTermination handler to streamCompletion() in OpenAIClient,
AnthropicClient, and GeminiClient — cancels the inner Task when the consumer
stops iterating, preventing network task leaks

SEC-10: Validate URL scheme in AIImageContent.validated() — only http/https
accepted, prevents file:// or internal URLs being forwarded to provider APIs
GeminiPart CodingKeys: use camelCase (inlineData, functionCall,
functionResponse) — Gemini REST API v1 uses camelCase throughout.
Snake_case was silently dropping vision and tool call parts.

GenerateContentResponse.Part and Candidate: remove redundant explicit
CodingKeys that mapped camelCase to camelCase — Swift default synthesis
already handles this correctly.

FunctionResponse.name: use metadata tool_name (function name) instead of
tool_call_id — Gemini function_response.name must be the function name,
not the opaque call identifier.

AIAgentProtocol agentic loop: add tool_name to tool result metadata
alongside tool_call_id so Gemini can correctly identify function responses.

AnthropicScalar decoder: move Bool check before Int to avoid JSON boolean
ambiguity in the Codable path.

AnthropicInput.toJSONString: replace JSONSerialization with JSONEncoder on
the typed AnthropicScalar dict — eliminates NSNumber ambiguity and removes
the only remaining use of [String: Any] in the encode path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant