Skip to content

Add office file exporters (DOCX/XLSX/PPTX/RTF) for Markdown output#34

Open
ronaldmannak wants to merge 5 commits into
mainfrom
claude/markdown-to-office-export-bexpyx
Open

Add office file exporters (DOCX/XLSX/PPTX/RTF) for Markdown output#34
ronaldmannak wants to merge 5 commits into
mainfrom
claude/markdown-to-office-export-bexpyx

Conversation

@ronaldmannak

Copy link
Copy Markdown
Contributor

Implement the write-side of PicoDocs: convert Markdown and ConverterResult into binary office formats, completing the round-trip capability.

Summary

This PR adds a complete export pipeline that mirrors the existing read-side converters. Users can now serialize LLM Markdown output or structured document results into DOCX, XLSX, PPTX, and RTF files via PicoDocsEngine.write(...). The implementation includes hand-rolled OOXML writers for DOCX/XLSX/PPTX (ensuring round-trip fidelity with the existing readers), platform-native exporters for RTF, and shared Markdown intermediate representations used by both renderers and exporters.

Key Changes

  • Shared Markdown IR: Extracted block-level (MarkdownBlock) and inline (MarkdownInline) parsing from DocumentRenderer into reusable enums (MarkdownBlockParser, MarkdownInlineParser). Both renderers and exporters now consume the same structured representation, ensuring consistency and simplifying future parser improvements.

  • Hand-rolled OOXML exporters:

    • WordprocessingMLExporter: Writes minimal-but-valid DOCX files with headings (via w:pStyle), emphasis (w:b/w:i), lists (w:numPr), tables, links (w:hyperlink), and embedded images. Deliberately mirrors WordConverter's markers for round-trip validation.
    • XLSXExporter: Generates SpreadsheetML with one sheet per document section. Uses lossless metadata["csv"] when present; otherwise flattens Markdown to rows (matching DocumentRenderer.renderCSV).
    • PPTXExporter: Creates PresentationML with one slide per top-level heading or explicit .slide section. Includes fixed template parts (slide master, layout, theme) via PPTXTemplates.
  • Platform-native exporters:

    • AttributedStringRTFExporter: Builds NSAttributedString from the Markdown IR and serializes to RTF (available across all Apple platforms).
    • AttributedStringDOCXExporter: Optional macOS-only DOCX writer using NSAttributedString.officeOpenXML; intentionally not registered by default (hand-rolled version is the primary).
    • AttributedStringDocumentBuilder: Shared utility that constructs attributed strings from ConverterResult, supporting headings, emphasis, lists, code, blockquotes, and tables.
  • Infrastructure:

    • DocumentExporter protocol: Sendable, synchronous serializers (pure CPU, no I/O).
    • DocumentExporterRegistry: Immutable, priority-ordered registry of exporters (mirrors DocumentConverterRegistry).
    • OOXMLPackageWriter: In-memory ZIP builder for OOXML files (no temp files; pure Data output).
    • ExportableFileType: Enum for writable formats (DOCX, XLSX, PPTX, RTF, Pages, Keynote). Pages/Keynote are marked unimplemented with research notes explaining why third-party iWork writing is impractical.
    • OfficeMediaType: Shared MIME type mapping for image extensions (used by both readers and writers).
  • Engine API: Added PicoDocsEngine.write(result:to:registry:) and PicoDocsEngine.write(markdown:to:) for serialization.

  • Tests: ExporterTests validates round-trip (write → re-import with existing converters → compare recovered text) for DOCX and XLSX, and structural validation for PPTX (unzip, required parts, title text).

  • Documentation: Updated README with export examples.

Notable Implementation Details

  • Round-trip oracle: DOCX/XLSX exporters are validated by re-importing with WordConverter/SpreadsheetConverter and comparing recovered Markdown, using the readers as ground truth.
  • Markdown IR extraction: DocumentRenderer's inline helpers remain for

https://claude.ai/code/session_01MRakJBZVBQ4CcTdcypk9Z6

PicoDocs was one-directional (office files -> Markdown). This adds the
"exporter" half of the engine: turn LLM Markdown output or a structured
ConverterResult into DOCX, XLSX, PPTX, or RTF.

Architecture mirrors the import side:
- DocumentExporter protocol + DocumentExporterRegistry (value type,
  priority-ordered, .registering chaining) parallel DocumentConverter/Registry.
- PicoDocsEngine.write(_:to:) / write(markdown:to:) / transcode(...) parallel
  convert/export, returning Data. Empty input throws .emptyDocument (image-only
  results are exempt); unimplemented formats throw .unableToExportToRequestedFormat.
- New ExportableFileType (binary office outputs), kept separate from the text
  ExportFileType.

Shared intermediate representation:
- Extracted the renderer's private Block parser into internal MarkdownBlock /
  MarkdownBlockParser so both halves use one parser (behavior-preserving;
  DocumentRenderer repointed at it).
- Added a structured MarkdownInline IR for exporters that need run structure.

Per-format writers:
- WordprocessingMLExporter: primary, all-platform DOCX writer (hand-rolled OOXML),
  the round-trip oracle; mirrors WordConverter's markers (Heading{N} styles,
  w:numPr lists, w:b/w:i, w:hyperlink, a:blip image drawings).
- XLSXExporter / PPTXExporter: hand-rolled OOXML on a shared OOXMLPackageWriter
  (ZIPFoundation write).
- AttributedStringRTFExporter (Apple) builds NSAttributedString from the IR (no
  WebKit HTML import, so it stays thread-agnostic); AttributedStringDOCXExporter is
  opt-in/macOS-only and NOT registered by default.
- Pages/Keynote are a documented spike only (no writer).
- Lifted WordConverter.mimeType(forExtension:) into shared OfficeMediaType.

Tests: round-trip DOCX/XLSX (via WordConverter/SpreadsheetConverter), RTF (Apple,
via RTFConverter), structural PPTX, inline-IR unit tests, and the empty/unsupported
contract.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01MRakJBZVBQ4CcTdcypk9Z6

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces binary office document exporting capabilities (DOCX, XLSX, PPTX, and RTF) to PicoDocs, sharing a unified block and inline Markdown intermediate representation between converters and exporters. Feedback on the changes highlights several key improvement opportunities: sanitizing invalid XML 1.0 control characters to prevent corrupted office files, making image lookups robust against path prefixes, preserving multi-scalar grapheme clusters (such as emojis) during sheet name sanitization, and pre-compiling regular expressions in the inline parser to avoid recursive overhead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +70 to +74
static func escape(_ text: String) -> String {
text.replacingOccurrences(of: "&", with: "&amp;")
.replacingOccurrences(of: "<", with: "&lt;")
.replacingOccurrences(of: ">", with: "&gt;")
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

XML 1.0 documents do not allow certain control characters (e.g., 0x00 to 0x08, 0x0B, 0x0C, 0x0E to 0x1F except tab, newline, and carriage return). If these characters are present in the text (for example, extracted from PDFs or generated by LLMs), writing them directly into the XML will result in corrupted office files that fail to open in Microsoft Office. We should sanitize the input string by filtering out these invalid XML characters.

Suggested change
static func escape(_ text: String) -> String {
text.replacingOccurrences(of: "&", with: "&amp;")
.replacingOccurrences(of: "<", with: "&lt;")
.replacingOccurrences(of: ">", with: "&gt;")
}
static func escape(_ text: String) -> String {
let sanitized = String(text.unicodeScalars.filter { scalar in
let val = scalar.value
return val == 0x9 || val == 0xA || val == 0xD ||
(val >= 0x20 && val <= 0xD7FF) ||
(val >= 0xE000 && val <= 0xFFFD) ||
(val >= 0x10000 && val <= 0x10FFFF)
})
return sanitized.replacingOccurrences(of: "&", with: "&amp;")
.replacingOccurrences(of: "<", with: "&lt;")
.replacingOccurrences(of: ">", with: "&gt;")
}

Comment on lines +178 to +179
private func imageRun(alt: String, source: String) -> String? {
guard let data = images[source] else { return nil }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the image source in the Markdown contains a path (e.g., word/media/pic.png) rather than just the basename, looking up images[source] will fail because the images index is keyed by the last path component (basename). To make the lookup robust, we should fall back to checking the last path component of the source string.

Suggested change
private func imageRun(alt: String, source: String) -> String? {
guard let data = images[source] else { return nil }
private func imageRun(alt: String, source: String) -> String? {
let key = images[source] != nil ? source : (source as NSString).lastPathComponent
guard let data = images[key] else { return nil }

Comment on lines +139 to +144
private static func sanitizeSheetName(_ name: String) -> String {
let invalid = CharacterSet(charactersIn: ":\\/?*[]")
let cleaned = String(name.unicodeScalars.map { invalid.contains($0) ? Character(" ") : Character($0) })
.trimmingCharacters(in: .whitespacesAndNewlines)
return String(cleaned.prefix(31))
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Mapping each UnicodeScalar individually to a Character and then joining them breaks multi-scalar grapheme clusters (such as flag emojis, skin tone modifiers, or family emojis) into separate characters. To preserve grapheme clusters, we should map the scalars to a new UnicodeScalarView first, and then convert that view to a String.

Suggested change
private static func sanitizeSheetName(_ name: String) -> String {
let invalid = CharacterSet(charactersIn: ":\\/?*[]")
let cleaned = String(name.unicodeScalars.map { invalid.contains($0) ? Character(" ") : Character($0) })
.trimmingCharacters(in: .whitespacesAndNewlines)
return String(cleaned.prefix(31))
}
private static func sanitizeSheetName(_ name: String) -> String {
let invalid = CharacterSet(charactersIn: ":\\/?*[]")
let cleanedScalars = name.unicodeScalars.map { invalid.contains($0) ? UnicodeScalar(32) : $0 }
let cleaned = String(String.UnicodeScalarView(cleanedScalars))
.trimmingCharacters(in: .whitespacesAndNewlines)
return String(cleaned.prefix(31))
}

Comment on lines +132 to +142

/// Parses `***`/`**`/`*` emphasis into nested nodes, preferring (at the same
/// position) the longest delimiter — mirroring the renderer's pass order so
/// `***x***` becomes strong(emphasis(x)).
static func parseEmphasis(_ text: String) -> [MarkdownInline] {
guard !text.isEmpty else { return [] }
let patterns: [(pattern: String, wrap: ([MarkdownInline]) -> MarkdownInline)] = [
("\\*\\*\\*(.+?)\\*\\*\\*", { .strong([.emphasis($0)]) }),
("\\*\\*(.+?)\\*\\*", { .strong($0) }),
("\\*(.+?)\\*", { .emphasis($0) }),
]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Compiling regular expressions is relatively expensive. Since parseEmphasis is called recursively on inline runs, compiling the three regular expressions on every single call introduces unnecessary CPU overhead. We should compile them once as static properties of the parser.

    private static let strongEmphasisRegex = try! NSRegularExpression(pattern: "\\*\\*\\*(.+?)\\*\\*\\*")
    private static let strongRegex = try! NSRegularExpression(pattern: "\\*\\*(.+?)\\*\\*")
    private static let emphasisRegex = try! NSRegularExpression(pattern: "\\*(.+?)\\*")

    /// Parses `***`/`**`/`*` emphasis into nested nodes, preferring (at the same
    /// position) the longest delimiter — mirroring the renderer's pass order so
    /// `***x***` becomes strong(emphasis(x)).
    static func parseEmphasis(_ text: String) -> [MarkdownInline] {
        guard !text.isEmpty else { return [] }
        let patterns: [(regex: NSRegularExpression, wrap: ([MarkdownInline]) -> MarkdownInline)] = [
            (strongEmphasisRegex, { .strong([.emphasis($0)]) }),
            (strongRegex, { .strong($0) }),
            (emphasisRegex, { .emphasis($0) }),
        ]

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bc33c18899

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

filename += ".\(ext)"
}
mediaExtensions.insert(ext)
media.append((filename, data))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reuse media parts for repeated image references

When a result contains two inline references to the same embedded image basename (for example, a DOCX that reuses the same logo; WordConverter keeps both Markdown refs but de-duplicates the .image section), imageRun appends the same filename to media for every occurrence. The later package write then adds word/media/<filename> multiple times, which either fails on duplicate ZIP entries or creates an invalid OOXML package with duplicate part names. Cache emitted media by source/filename and reuse it instead of appending a new part per run.

Useful? React with 👍 / 👎.

external: false
))
let docPrID = media.count
let name = OOXMLPackageWriter.escapeAttribute(source.isEmpty ? "image" : source)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve Markdown image alt text in DOCX drawings

When Markdown contains meaningful alt text such as ![Sales chart](chart.png) with matching image data, this derives the drawing name from the source filename and emits no descr; WordConverter.imageAltText reads descr first and then name, so re-importing the generated DOCX returns the filename as the alt text instead of Sales chart. Keep the filename for the media target, but write the escaped alt into the drawing alt-text attributes (at least wp:docPr descr).

Useful? React with 👍 / 👎.

Comment on lines +70 to +73
static func escape(_ text: String) -> String {
text.replacingOccurrences(of: "&", with: "&amp;")
.replacingOccurrences(of: "<", with: "&lt;")
.replacingOccurrences(of: ">", with: "&gt;")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Filter XML-invalid scalars before writing OOXML text

For write(markdown:) input containing XML 1.0-invalid control scalars (for example NUL or vertical tab), this helper only escapes markup characters and leaves those scalars in document.xml, worksheet XML, or slide XML. The ZIP can still be produced, but the XML parts are not well-formed and Office/readers will reject the output; because raw Markdown writes do not run sanitizeUnicode, the OOXML escaping path should drop or replace disallowed XML characters before serialization.

Useful? React with 👍 / 👎.

private static func parseLinkOrImage(_ chars: [Character], from: Int, isImage: Bool) -> (node: MarkdownInline, next: Int)? {
let bracket = isImage ? from + 1 : from
guard bracket < chars.count, chars[bracket] == "[" else { return nil }
guard let labelEnd = firstIndex(of: "]", in: chars, from: bracket + 1) else { return nil }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor escaped label delimiters when parsing links

Canonical Markdown from WordConverter escapes [/] in link labels and image alt text, but this delimiter search treats an escaped \] as the end of the label. When transcoding a DOCX link or image whose visible text contains ], the parser fails to recognize the link/image and the exporter writes literal Markdown instead of the original object; skip backslash-escaped delimiters and unescape the label before building the inline node.

Useful? React with 👍 / 👎.

Comment on lines +120 to +121
guard let parenClose = firstIndex(of: ")", in: chars, from: destStart) else { return nil }
dest = String(chars[destStart..<parenClose])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle balanced parentheses in link destinations

For raw Markdown passed to write(markdown:) with a common URL such as [spec](https://example.com/Foo_(bar)), this stops the destination at the first ), so the generated DOCX relationship target is truncated and the remaining ) is emitted as ordinary text. Since this API is advertised for LLM Markdown output, parse balanced parentheses (or normalize to angle-bracket destinations) before creating office hyperlinks.

Useful? React with 👍 / 👎.


case .list(let ordered, let items):
usedNumbering = true
let numId = ordered ? 2 : 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restart each ordered list with distinct numbering

When a Markdown document has two separate ordered lists with prose between them, every ordered list item is assigned the same numId and Word treats the second list as a continuation of the first instead of restarting at 1. Allocate a fresh numbering instance (or a start override) per ordered-list block so exported DOCX files preserve Markdown list boundaries in Word/Pages.

Useful? React with 👍 / 👎.

Comment thread Sources/PicoDocs/PicoDocs.swift Outdated
Comment on lines +109 to +110
if result.markdown().trimmingCharacters(in: .whitespacesAndNewlines).isEmpty,
!result.sections.contains(where: { $0.kind == .image }) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Render image-only results instead of blank files

This exception lets a ConverterResult that contains only .image sections pass validation, but the built-in exporters build their visible content from result.markdown() (which excludes .image carriers) and only embed images that are referenced inline. Passing an image-only result therefore produces a blank DOCX/PPTX/RTF or placeholder XLSX instead of the advertised valid image output; either synthesize image references for these sections or reject image-only results for formats that cannot render them.

Useful? React with 👍 / 👎.

- OOXMLPackageWriter.escape: drop XML 1.0-invalid control scalars so raw
  Markdown (which skips sanitizeUnicode) can't produce non-well-formed parts.
- WordprocessingMLExporter: look images up by basename when the source has a
  path prefix; package/relate each media file once (no duplicate ZIP parts on
  repeated refs); write Markdown alt text into wp:docPr@descr so it round-trips;
  give each ordered list its own numbering instance so Word restarts it at 1.
- XLSXExporter.sanitizeSheetName: rebuild via UnicodeScalarView so multi-scalar
  grapheme clusters (emoji) survive.
- MarkdownInline: pre-compile emphasis regexes; honor backslash-escaped label
  delimiters and balanced parens in bare link/image destinations, unescaping
  the recovered label/destination.
- MarkdownInline: store only the compiled emphasis regexes in static state
  (nonisolated(unsafe), immutable/thread-safe) and keep the non-Sendable wrap
  closures local to parseEmphasis — fixes the Swift 6 MutableGlobalVariable build error.
- PicoDocsEngine.write: for image-only results (only .image carriers, empty
  markdown), synthesize an inline ![alt](name) reference per carrier so the
  images are embedded instead of producing a blank document.
- Tests: image-only DOCX embeds its image; parser handles balanced parens and
  escaped label delimiters.

Copy link
Copy Markdown
Contributor Author

Thanks @gemini-code-assist and @chatgpt-codex-connector — worked through all the review threads. Pushed in be8f6a1 and 66d1511. Summary of each:

Fixed

  • XML 1.0-invalid control chars (Gemini HIGH + Codex P2, OOXMLPackageWriter.escape): now filters disallowed scalars (NUL, vertical tab, etc.) before escaping, so raw write(markdown:) input — which skips sanitizeUnicode — can't produce non-well-formed document.xml/worksheet/slide parts.
  • Image lookup by basename (Gemini, WordprocessingMLExporter): falls back to the source's last path component when it carries a path prefix.
  • Sheet-name grapheme clusters (Gemini, XLSXExporter): rebuilds via UnicodeScalarView so emoji / multi-scalar clusters survive sanitization.
  • Pre-compiled emphasis regexes (Gemini, MarkdownInline): the three matchers are compiled once instead of per recursive call.
  • Duplicate media parts (Codex P2): repeated references to the same image now reuse a single media part + relationship (unique drawing ids), avoiding duplicate ZIP entries.
  • Alt text in DOCX drawings (Codex P2): Markdown alt text is written to wp:docPr/@descr, which WordConverter.imageAltText reads first, so it round-trips instead of collapsing to the filename.
  • Escaped label delimiters (Codex P2, MarkdownInline): the label scan skips backslash-escaped \] and unescapes the recovered label.
  • Balanced parens in destinations (Codex P2): bare link/image destinations match balanced parentheses, so https://e.com/Foo_(bar) isn't truncated.
  • Ordered-list restart (Codex P2): each ordered list gets its own numbering instance with a startOverride, so Word restarts separate lists at 1 instead of continuing.
  • Image-only results render blank (Codex P2, PicoDocs.write): image-only ConverterResults now synthesize an inline [alt](name) reference per .image carrier so the bytes are actually embedded, keeping the "image-only doc is valid output" contract honest.

Added tests covering image-only export, balanced-paren links, and escaped label delimiters. The Swift 6 build error from the first push (a non-Sendable closure in the static regex table) is fixed in 66d1511 by keeping only the compiled regexes in static storage.


Generated by Claude Code

@gemini-code-assist

Copy link
Copy Markdown

Thanks for the detailed summary of the fixes, @ronaldmannak. The approach to sanitizing XML scalars in OOXMLPackageWriter and the extraction of the MarkdownBlockParser and MarkdownInlineParser into shared IRs are excellent improvements for maintainability and correctness. The added tests for image-only exports and edge cases like balanced parentheses in links are also great to see. Everything looks solid and ready for review.

@chatgpt-codex-connector

Copy link
Copy Markdown

Summary

Testing

  • ⚠️ swift test — could not complete because SwiftPM dependency fetches were blocked by the environment’s network/proxy (CONNECT tunnel failed, response 403).
  • git status --short — clean working tree; no new changes to commit.

View task →

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 66d1511694

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +62 to +63
case .heading(_, let text):
rows.append([plain(text)])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip sheet-title headings when rebuilding XLSX rows

When writing a ConverterResult produced by SpreadsheetConverter, each .sheet section's Markdown starts with ## <sheetName> (see SpreadsheetConverter.markdownTable), while the sheet name is already carried separately in section.sheetName. Treating every heading as a row here means transcode(..., to: .xlsx) inserts the worksheet name into cell A1 and shifts the original spreadsheet data down by one row, corrupting round-tripped XLSX content. Skip the section-title heading for sheet sections or prefer a lossless row source for spreadsheet-derived sections.

Useful? React with 👍 / 👎.

Comment thread Sources/PicoDocs/PicoDocs.swift Outdated
Comment on lines +130 to +131
let name = (section.sourcePath as NSString?)?.lastPathComponent ?? section.title
guard let name, !name.isEmpty else { return nil }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Generate a fallback name for unnamed image carriers

The new synthesis path still drops any .image carrier that lacks both sourcePath and title, even though those fields are optional and write(_:) treats any image section as valid non-empty input. In that case an image-only result with valid metadata["base64"]/metadata["mimeType"] passes validation, refs stays empty, and the exporters receive no inline reference to the bytes, so the call succeeds with a blank/placeholder document instead of rendering the image; synthesize a stable filename from the index and MIME type instead of returning nil.

Useful? React with 👍 / 👎.

/// reports `false` and `write(...)` throws `unableToExportToRequestedFormat`.
public var isImplemented: Bool {
switch self {
case .docx, .rtf, .xlsx, .pptx: return true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make RTF support reflect the current platform

On non-Apple platforms the default registry does not register AttributedStringRTFExporter because it is inside #if canImport(AppKit) || canImport(UIKit), so .rtf writes throw unableToExportToRequestedFormat; however isImplemented still reports .rtf as supported. Server/Linux callers that use this property to advertise available export formats will offer an option that the built-in writer cannot fulfill, so gate .rtf the same way the registry does or document a separate availability API.

Useful? React with 👍 / 👎.

Comment on lines +52 to +54
let name = (section.sourcePath as NSString?)?.lastPathComponent ?? section.title
guard let name, !name.isEmpty else { continue }
index[name] = data

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve path-qualified image identities in DOCX export

When callers provide a structured result with two image sections such as charts/logo.png and headers/logo.png and matching inline refs, this index collapses both carriers to the same basename key and the later section overwrites the earlier bytes. Both drawings then resolve to the same media part even though imageRun tries to support path-prefixed sources, so distinct images with equal basenames are rendered incorrectly; index by source path as well as basename, and only fall back to basename when it is unambiguous.

Useful? React with 👍 / 👎.

Comment on lines +209 to +211
if ext.isEmpty {
ext = OfficeMediaType.fileExtension(forMIME: OfficeMediaType.mimeType(forExtension: ext))
filename += ".\(ext)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use image MIME metadata when adding missing extensions

For an embedded image whose reference/name has no extension but whose .image section carries metadata["mimeType"] (for example image/png), this derives the fallback extension by first asking for the MIME type of an empty extension, which always becomes application/octet-stream, and then maps that to .bin. The resulting DOCX stores word/media/<name>.bin with an octet-stream content type, so Office cannot reliably recognize or render otherwise valid PNG/JPEG bytes; carry the section MIME through the image index and use it here when ext.isEmpty.

Useful? React with 👍 / 👎.

- XLSXExporter.rows: drop a leading heading that echoes the section's
  sheetName/title so SpreadsheetConverter-derived sheets (which prefix `## <name>`)
  don't push the worksheet name into A1 and shift data down a row.
- ExportableFileType.isImplemented: gate `.rtf` behind canImport(AppKit)||canImport(UIKit)
  so it doesn't advertise an export the (Apple-only) RTF writer can't fulfill on Linux.
- WordprocessingMLExporter image index: rebuild as a struct carrying a unique,
  correctly-typed media filename per carrier; key by full source path and by
  unambiguous basename so same-basename images from different paths stay distinct,
  and derive missing extensions from the carrier MIME (no more .bin/octet-stream).
- PicoDocsEngine.withSynthesizedImageReferences: generate a stable image-<n>.<ext>
  name (ext from MIME) for unnamed carriers and assign it as sourcePath so the
  exporter index resolves it — image-only results with MIME-only carriers now embed.
- Tests: XLSX sheet-title not a data row; same-basename images stay distinct;
  unnamed MIME-only image embeds as .png; .rtf availability matches platform.

Copy link
Copy Markdown
Contributor Author

Second review round (@chatgpt-codex-connector) addressed in 4676e9b — all 5 P2 items confirmed as real correctness gaps and fixed:

  • XLSX sheet-title becomes a data row (XLSXExporter.rows): SpreadsheetConverter prefixes each sheet with ## <sheetName> and carries the name in sheetName but sets no csv metadata, so the flatten path put the title in A1 and shifted data down. Now a leading heading that echoes sheetName/title is dropped (genuine in-body headings stay).
  • .rtf over-reports availability (ExportableFileType.isImplemented): the RTF writer is registered only under #if canImport(AppKit) || canImport(UIKit), so .rtf is now gated the same way and Linux callers no longer advertise an export that throws.
  • Same-basename images collapse (WordprocessingMLExporter image index): rebuilt as a struct assigning each carrier a unique, correctly-typed media filename; keyed by full source path and by unambiguous basename, so charts/logo.png and headers/logo.png write two distinct parts (logo.png, logo-2.png).
  • Wrong extension for MIME-only images: the missing extension is now taken from the carrier's MIME (image/png.png) at index-build time, so valid bytes are no longer stored as .bin/octet-stream.
  • Unnamed image carriers dropped (withSynthesizedImageReferences): a carrier with only base64/mimeType now gets a stable image-<n>.<ext> name (ext from MIME), assigned as its sourcePath so the exporter index resolves the same key — image-only results with such carriers embed instead of producing a blank document.

Tests added: XLSX sheet title isn't a data cell; same-basename images stay distinct; an unnamed MIME-only image embeds as .png; .rtf availability matches the platform gate.

Build (macOS, Swift 6) is running on 4676e9b; I'll follow up if it's not green.


Generated by Claude Code

@chatgpt-codex-connector

Copy link
Copy Markdown

Summary

Testing

  • ⚠️ swift test — could not complete because SwiftPM dependency fetches were blocked by the environment proxy (CONNECT tunnel failed, response 403).
  • git status --short — clean working tree; no new changes to commit.

View task →

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d8f9504801

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +135 to +145
var name = (section.sourcePath as NSString?)?.lastPathComponent ?? section.title
if name?.isEmpty ?? true {
generatedCount += 1
let ext = OfficeMediaType.fileExtension(forMIME: section.metadata["mimeType"] ?? "")
let generated = "image-\(generatedCount).\(ext)"
sections[index].sourcePath = generated
name = generated
}
guard let name, !name.isEmpty else { continue }
let alt = section.title ?? name
refs.append(DocumentSection(kind: .body, markdown: "![\(alt)](\(name))"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve paths in synthesized image refs

When an image-only result contains two carriers with the same basename under different sourcePaths, this synthesis strips both references down to that basename (for example charts/logo.png and headers/logo.png both become ![logo.png](logo.png)). The DOCX image index intentionally drops ambiguous basename lookups and only keeps the full paths, so these synthesized references no longer resolve and the exported DOCX falls back to text instead of embedding either image; the fresh evidence is this new synthesis path using lastPathComponent before creating the Markdown ref.

Useful? React with 👍 / 👎.

Comment on lines +71 to +72
case .table(let tableRows):
rows += tableRows

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Strip inline Markdown from XLSX table cells

For table rows this appends the parsed cell text directly, unlike headings, paragraphs, and list items which go through plain(_). When transcoding a DOCX/Markdown table cell with inline formatting such as **Total** or a link label, the generated XLSX cell contains the Markdown syntax instead of the visible cell value, corrupting spreadsheet data; map table cells through the same plain-text projection before writing rows.

Useful? React with 👍 / 👎.

Comment on lines +63 to +66
if case .heading(_, let text)? = blocks.first,
let title = section.sheetName ?? section.title,
plain(text) == title {
blocks.removeFirst()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Limit sheet-title stripping to sheet sections

This removes any leading heading whose text matches section.sheetName ?? section.title, even for non-spreadsheet sections. A normal titled body/chapter such as DocumentSection(title: "Intro", kind: .body, markdown: "# Intro\n\n...") exported to XLSX loses the Intro row, while only .sheet sections from SpreadsheetConverter have the redundant sheet-title heading that needs dropping.

Useful? React with 👍 / 👎.

Comment on lines +70 to +72
if case .heading(let level, let text) = block, level <= 2 {
flush()
title = text

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Normalize PPTX heading titles before emitting

For Markdown split into slides, the slide title is taken directly from the heading text and later XML-escaped only, while body lines go through plain(_). A heading like # **Q4** results or # [Spec](https://example.com) therefore produces a PPTX title showing literal Markdown syntax instead of the visible title text; run the heading through the same inline plain-text projection or emit rich text runs.

Useful? React with 👍 / 👎.

Comment on lines +126 to +127
case .footnoteReference:
break // references carry no inline glyph in this projection

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve footnote markers in RTF exports

On Apple platforms the default RTF writer goes through this builder, and any Markdown/converted DOCX containing [^id] markers loses them here. The body reference is dropped entirely, and a footnote definition like [^fn1]: note becomes just : note, so the exported RTF no longer preserves where the note was referenced or which definition it belongs to; emit a literal marker as the DOCX writer does instead of skipping it.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants