Add office file exporters (DOCX/XLSX/PPTX/RTF) for Markdown output#34
Add office file exporters (DOCX/XLSX/PPTX/RTF) for Markdown output#34ronaldmannak wants to merge 5 commits into
Conversation
PicoDocs was one-directional (office files -> Markdown). This adds the
"exporter" half of the engine: turn LLM Markdown output or a structured
ConverterResult into DOCX, XLSX, PPTX, or RTF.
Architecture mirrors the import side:
- DocumentExporter protocol + DocumentExporterRegistry (value type,
priority-ordered, .registering chaining) parallel DocumentConverter/Registry.
- PicoDocsEngine.write(_:to:) / write(markdown:to:) / transcode(...) parallel
convert/export, returning Data. Empty input throws .emptyDocument (image-only
results are exempt); unimplemented formats throw .unableToExportToRequestedFormat.
- New ExportableFileType (binary office outputs), kept separate from the text
ExportFileType.
Shared intermediate representation:
- Extracted the renderer's private Block parser into internal MarkdownBlock /
MarkdownBlockParser so both halves use one parser (behavior-preserving;
DocumentRenderer repointed at it).
- Added a structured MarkdownInline IR for exporters that need run structure.
Per-format writers:
- WordprocessingMLExporter: primary, all-platform DOCX writer (hand-rolled OOXML),
the round-trip oracle; mirrors WordConverter's markers (Heading{N} styles,
w:numPr lists, w:b/w:i, w:hyperlink, a:blip image drawings).
- XLSXExporter / PPTXExporter: hand-rolled OOXML on a shared OOXMLPackageWriter
(ZIPFoundation write).
- AttributedStringRTFExporter (Apple) builds NSAttributedString from the IR (no
WebKit HTML import, so it stays thread-agnostic); AttributedStringDOCXExporter is
opt-in/macOS-only and NOT registered by default.
- Pages/Keynote are a documented spike only (no writer).
- Lifted WordConverter.mimeType(forExtension:) into shared OfficeMediaType.
Tests: round-trip DOCX/XLSX (via WordConverter/SpreadsheetConverter), RTF (Apple,
via RTFConverter), structural PPTX, inline-IR unit tests, and the empty/unsupported
contract.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01MRakJBZVBQ4CcTdcypk9Z6
There was a problem hiding this comment.
Code Review
This pull request introduces binary office document exporting capabilities (DOCX, XLSX, PPTX, and RTF) to PicoDocs, sharing a unified block and inline Markdown intermediate representation between converters and exporters. Feedback on the changes highlights several key improvement opportunities: sanitizing invalid XML 1.0 control characters to prevent corrupted office files, making image lookups robust against path prefixes, preserving multi-scalar grapheme clusters (such as emojis) during sheet name sanitization, and pre-compiling regular expressions in the inline parser to avoid recursive overhead.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| static func escape(_ text: String) -> String { | ||
| text.replacingOccurrences(of: "&", with: "&") | ||
| .replacingOccurrences(of: "<", with: "<") | ||
| .replacingOccurrences(of: ">", with: ">") | ||
| } |
There was a problem hiding this comment.
XML 1.0 documents do not allow certain control characters (e.g., 0x00 to 0x08, 0x0B, 0x0C, 0x0E to 0x1F except tab, newline, and carriage return). If these characters are present in the text (for example, extracted from PDFs or generated by LLMs), writing them directly into the XML will result in corrupted office files that fail to open in Microsoft Office. We should sanitize the input string by filtering out these invalid XML characters.
| static func escape(_ text: String) -> String { | |
| text.replacingOccurrences(of: "&", with: "&") | |
| .replacingOccurrences(of: "<", with: "<") | |
| .replacingOccurrences(of: ">", with: ">") | |
| } | |
| static func escape(_ text: String) -> String { | |
| let sanitized = String(text.unicodeScalars.filter { scalar in | |
| let val = scalar.value | |
| return val == 0x9 || val == 0xA || val == 0xD || | |
| (val >= 0x20 && val <= 0xD7FF) || | |
| (val >= 0xE000 && val <= 0xFFFD) || | |
| (val >= 0x10000 && val <= 0x10FFFF) | |
| }) | |
| return sanitized.replacingOccurrences(of: "&", with: "&") | |
| .replacingOccurrences(of: "<", with: "<") | |
| .replacingOccurrences(of: ">", with: ">") | |
| } |
| private func imageRun(alt: String, source: String) -> String? { | ||
| guard let data = images[source] else { return nil } |
There was a problem hiding this comment.
If the image source in the Markdown contains a path (e.g., word/media/pic.png) rather than just the basename, looking up images[source] will fail because the images index is keyed by the last path component (basename). To make the lookup robust, we should fall back to checking the last path component of the source string.
| private func imageRun(alt: String, source: String) -> String? { | |
| guard let data = images[source] else { return nil } | |
| private func imageRun(alt: String, source: String) -> String? { | |
| let key = images[source] != nil ? source : (source as NSString).lastPathComponent | |
| guard let data = images[key] else { return nil } |
| private static func sanitizeSheetName(_ name: String) -> String { | ||
| let invalid = CharacterSet(charactersIn: ":\\/?*[]") | ||
| let cleaned = String(name.unicodeScalars.map { invalid.contains($0) ? Character(" ") : Character($0) }) | ||
| .trimmingCharacters(in: .whitespacesAndNewlines) | ||
| return String(cleaned.prefix(31)) | ||
| } |
There was a problem hiding this comment.
Mapping each UnicodeScalar individually to a Character and then joining them breaks multi-scalar grapheme clusters (such as flag emojis, skin tone modifiers, or family emojis) into separate characters. To preserve grapheme clusters, we should map the scalars to a new UnicodeScalarView first, and then convert that view to a String.
| private static func sanitizeSheetName(_ name: String) -> String { | |
| let invalid = CharacterSet(charactersIn: ":\\/?*[]") | |
| let cleaned = String(name.unicodeScalars.map { invalid.contains($0) ? Character(" ") : Character($0) }) | |
| .trimmingCharacters(in: .whitespacesAndNewlines) | |
| return String(cleaned.prefix(31)) | |
| } | |
| private static func sanitizeSheetName(_ name: String) -> String { | |
| let invalid = CharacterSet(charactersIn: ":\\/?*[]") | |
| let cleanedScalars = name.unicodeScalars.map { invalid.contains($0) ? UnicodeScalar(32) : $0 } | |
| let cleaned = String(String.UnicodeScalarView(cleanedScalars)) | |
| .trimmingCharacters(in: .whitespacesAndNewlines) | |
| return String(cleaned.prefix(31)) | |
| } |
|
|
||
| /// Parses `***`/`**`/`*` emphasis into nested nodes, preferring (at the same | ||
| /// position) the longest delimiter — mirroring the renderer's pass order so | ||
| /// `***x***` becomes strong(emphasis(x)). | ||
| static func parseEmphasis(_ text: String) -> [MarkdownInline] { | ||
| guard !text.isEmpty else { return [] } | ||
| let patterns: [(pattern: String, wrap: ([MarkdownInline]) -> MarkdownInline)] = [ | ||
| ("\\*\\*\\*(.+?)\\*\\*\\*", { .strong([.emphasis($0)]) }), | ||
| ("\\*\\*(.+?)\\*\\*", { .strong($0) }), | ||
| ("\\*(.+?)\\*", { .emphasis($0) }), | ||
| ] |
There was a problem hiding this comment.
Compiling regular expressions is relatively expensive. Since parseEmphasis is called recursively on inline runs, compiling the three regular expressions on every single call introduces unnecessary CPU overhead. We should compile them once as static properties of the parser.
private static let strongEmphasisRegex = try! NSRegularExpression(pattern: "\\*\\*\\*(.+?)\\*\\*\\*")
private static let strongRegex = try! NSRegularExpression(pattern: "\\*\\*(.+?)\\*\\*")
private static let emphasisRegex = try! NSRegularExpression(pattern: "\\*(.+?)\\*")
/// Parses `***`/`**`/`*` emphasis into nested nodes, preferring (at the same
/// position) the longest delimiter — mirroring the renderer's pass order so
/// `***x***` becomes strong(emphasis(x)).
static func parseEmphasis(_ text: String) -> [MarkdownInline] {
guard !text.isEmpty else { return [] }
let patterns: [(regex: NSRegularExpression, wrap: ([MarkdownInline]) -> MarkdownInline)] = [
(strongEmphasisRegex, { .strong([.emphasis($0)]) }),
(strongRegex, { .strong($0) }),
(emphasisRegex, { .emphasis($0) }),
]There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bc33c18899
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| filename += ".\(ext)" | ||
| } | ||
| mediaExtensions.insert(ext) | ||
| media.append((filename, data)) |
There was a problem hiding this comment.
Reuse media parts for repeated image references
When a result contains two inline references to the same embedded image basename (for example, a DOCX that reuses the same logo; WordConverter keeps both Markdown refs but de-duplicates the .image section), imageRun appends the same filename to media for every occurrence. The later package write then adds word/media/<filename> multiple times, which either fails on duplicate ZIP entries or creates an invalid OOXML package with duplicate part names. Cache emitted media by source/filename and reuse it instead of appending a new part per run.
Useful? React with 👍 / 👎.
| external: false | ||
| )) | ||
| let docPrID = media.count | ||
| let name = OOXMLPackageWriter.escapeAttribute(source.isEmpty ? "image" : source) |
There was a problem hiding this comment.
Preserve Markdown image alt text in DOCX drawings
When Markdown contains meaningful alt text such as  with matching image data, this derives the drawing name from the source filename and emits no descr; WordConverter.imageAltText reads descr first and then name, so re-importing the generated DOCX returns the filename as the alt text instead of Sales chart. Keep the filename for the media target, but write the escaped alt into the drawing alt-text attributes (at least wp:docPr descr).
Useful? React with 👍 / 👎.
| static func escape(_ text: String) -> String { | ||
| text.replacingOccurrences(of: "&", with: "&") | ||
| .replacingOccurrences(of: "<", with: "<") | ||
| .replacingOccurrences(of: ">", with: ">") |
There was a problem hiding this comment.
Filter XML-invalid scalars before writing OOXML text
For write(markdown:) input containing XML 1.0-invalid control scalars (for example NUL or vertical tab), this helper only escapes markup characters and leaves those scalars in document.xml, worksheet XML, or slide XML. The ZIP can still be produced, but the XML parts are not well-formed and Office/readers will reject the output; because raw Markdown writes do not run sanitizeUnicode, the OOXML escaping path should drop or replace disallowed XML characters before serialization.
Useful? React with 👍 / 👎.
| private static func parseLinkOrImage(_ chars: [Character], from: Int, isImage: Bool) -> (node: MarkdownInline, next: Int)? { | ||
| let bracket = isImage ? from + 1 : from | ||
| guard bracket < chars.count, chars[bracket] == "[" else { return nil } | ||
| guard let labelEnd = firstIndex(of: "]", in: chars, from: bracket + 1) else { return nil } |
There was a problem hiding this comment.
Honor escaped label delimiters when parsing links
Canonical Markdown from WordConverter escapes [/] in link labels and image alt text, but this delimiter search treats an escaped \] as the end of the label. When transcoding a DOCX link or image whose visible text contains ], the parser fails to recognize the link/image and the exporter writes literal Markdown instead of the original object; skip backslash-escaped delimiters and unescape the label before building the inline node.
Useful? React with 👍 / 👎.
| guard let parenClose = firstIndex(of: ")", in: chars, from: destStart) else { return nil } | ||
| dest = String(chars[destStart..<parenClose]) |
There was a problem hiding this comment.
Handle balanced parentheses in link destinations
For raw Markdown passed to write(markdown:) with a common URL such as [spec](https://example.com/Foo_(bar)), this stops the destination at the first ), so the generated DOCX relationship target is truncated and the remaining ) is emitted as ordinary text. Since this API is advertised for LLM Markdown output, parse balanced parentheses (or normalize to angle-bracket destinations) before creating office hyperlinks.
Useful? React with 👍 / 👎.
|
|
||
| case .list(let ordered, let items): | ||
| usedNumbering = true | ||
| let numId = ordered ? 2 : 1 |
There was a problem hiding this comment.
Restart each ordered list with distinct numbering
When a Markdown document has two separate ordered lists with prose between them, every ordered list item is assigned the same numId and Word treats the second list as a continuation of the first instead of restarting at 1. Allocate a fresh numbering instance (or a start override) per ordered-list block so exported DOCX files preserve Markdown list boundaries in Word/Pages.
Useful? React with 👍 / 👎.
| if result.markdown().trimmingCharacters(in: .whitespacesAndNewlines).isEmpty, | ||
| !result.sections.contains(where: { $0.kind == .image }) { |
There was a problem hiding this comment.
Render image-only results instead of blank files
This exception lets a ConverterResult that contains only .image sections pass validation, but the built-in exporters build their visible content from result.markdown() (which excludes .image carriers) and only embed images that are referenced inline. Passing an image-only result therefore produces a blank DOCX/PPTX/RTF or placeholder XLSX instead of the advertised valid image output; either synthesize image references for these sections or reject image-only results for formats that cannot render them.
Useful? React with 👍 / 👎.
- OOXMLPackageWriter.escape: drop XML 1.0-invalid control scalars so raw Markdown (which skips sanitizeUnicode) can't produce non-well-formed parts. - WordprocessingMLExporter: look images up by basename when the source has a path prefix; package/relate each media file once (no duplicate ZIP parts on repeated refs); write Markdown alt text into wp:docPr@descr so it round-trips; give each ordered list its own numbering instance so Word restarts it at 1. - XLSXExporter.sanitizeSheetName: rebuild via UnicodeScalarView so multi-scalar grapheme clusters (emoji) survive. - MarkdownInline: pre-compile emphasis regexes; honor backslash-escaped label delimiters and balanced parens in bare link/image destinations, unescaping the recovered label/destination.
- MarkdownInline: store only the compiled emphasis regexes in static state (nonisolated(unsafe), immutable/thread-safe) and keep the non-Sendable wrap closures local to parseEmphasis — fixes the Swift 6 MutableGlobalVariable build error. - PicoDocsEngine.write: for image-only results (only .image carriers, empty markdown), synthesize an inline  reference per carrier so the images are embedded instead of producing a blank document. - Tests: image-only DOCX embeds its image; parser handles balanced parens and escaped label delimiters.
|
Thanks @gemini-code-assist and @chatgpt-codex-connector — worked through all the review threads. Pushed in Fixed
Added tests covering image-only export, balanced-paren links, and escaped label delimiters. The Swift 6 build error from the first push (a non- Generated by Claude Code |
|
Thanks for the detailed summary of the fixes, @ronaldmannak. The approach to sanitizing XML scalars in |
Summary
Testing
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 66d1511694
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| case .heading(_, let text): | ||
| rows.append([plain(text)]) |
There was a problem hiding this comment.
Skip sheet-title headings when rebuilding XLSX rows
When writing a ConverterResult produced by SpreadsheetConverter, each .sheet section's Markdown starts with ## <sheetName> (see SpreadsheetConverter.markdownTable), while the sheet name is already carried separately in section.sheetName. Treating every heading as a row here means transcode(..., to: .xlsx) inserts the worksheet name into cell A1 and shifts the original spreadsheet data down by one row, corrupting round-tripped XLSX content. Skip the section-title heading for sheet sections or prefer a lossless row source for spreadsheet-derived sections.
Useful? React with 👍 / 👎.
| let name = (section.sourcePath as NSString?)?.lastPathComponent ?? section.title | ||
| guard let name, !name.isEmpty else { return nil } |
There was a problem hiding this comment.
Generate a fallback name for unnamed image carriers
The new synthesis path still drops any .image carrier that lacks both sourcePath and title, even though those fields are optional and write(_:) treats any image section as valid non-empty input. In that case an image-only result with valid metadata["base64"]/metadata["mimeType"] passes validation, refs stays empty, and the exporters receive no inline reference to the bytes, so the call succeeds with a blank/placeholder document instead of rendering the image; synthesize a stable filename from the index and MIME type instead of returning nil.
Useful? React with 👍 / 👎.
| /// reports `false` and `write(...)` throws `unableToExportToRequestedFormat`. | ||
| public var isImplemented: Bool { | ||
| switch self { | ||
| case .docx, .rtf, .xlsx, .pptx: return true |
There was a problem hiding this comment.
Make RTF support reflect the current platform
On non-Apple platforms the default registry does not register AttributedStringRTFExporter because it is inside #if canImport(AppKit) || canImport(UIKit), so .rtf writes throw unableToExportToRequestedFormat; however isImplemented still reports .rtf as supported. Server/Linux callers that use this property to advertise available export formats will offer an option that the built-in writer cannot fulfill, so gate .rtf the same way the registry does or document a separate availability API.
Useful? React with 👍 / 👎.
| let name = (section.sourcePath as NSString?)?.lastPathComponent ?? section.title | ||
| guard let name, !name.isEmpty else { continue } | ||
| index[name] = data |
There was a problem hiding this comment.
Preserve path-qualified image identities in DOCX export
When callers provide a structured result with two image sections such as charts/logo.png and headers/logo.png and matching inline refs, this index collapses both carriers to the same basename key and the later section overwrites the earlier bytes. Both drawings then resolve to the same media part even though imageRun tries to support path-prefixed sources, so distinct images with equal basenames are rendered incorrectly; index by source path as well as basename, and only fall back to basename when it is unambiguous.
Useful? React with 👍 / 👎.
| if ext.isEmpty { | ||
| ext = OfficeMediaType.fileExtension(forMIME: OfficeMediaType.mimeType(forExtension: ext)) | ||
| filename += ".\(ext)" |
There was a problem hiding this comment.
Use image MIME metadata when adding missing extensions
For an embedded image whose reference/name has no extension but whose .image section carries metadata["mimeType"] (for example image/png), this derives the fallback extension by first asking for the MIME type of an empty extension, which always becomes application/octet-stream, and then maps that to .bin. The resulting DOCX stores word/media/<name>.bin with an octet-stream content type, so Office cannot reliably recognize or render otherwise valid PNG/JPEG bytes; carry the section MIME through the image index and use it here when ext.isEmpty.
Useful? React with 👍 / 👎.
- XLSXExporter.rows: drop a leading heading that echoes the section's sheetName/title so SpreadsheetConverter-derived sheets (which prefix `## <name>`) don't push the worksheet name into A1 and shift data down a row. - ExportableFileType.isImplemented: gate `.rtf` behind canImport(AppKit)||canImport(UIKit) so it doesn't advertise an export the (Apple-only) RTF writer can't fulfill on Linux. - WordprocessingMLExporter image index: rebuild as a struct carrying a unique, correctly-typed media filename per carrier; key by full source path and by unambiguous basename so same-basename images from different paths stay distinct, and derive missing extensions from the carrier MIME (no more .bin/octet-stream). - PicoDocsEngine.withSynthesizedImageReferences: generate a stable image-<n>.<ext> name (ext from MIME) for unnamed carriers and assign it as sourcePath so the exporter index resolves it — image-only results with MIME-only carriers now embed. - Tests: XLSX sheet-title not a data row; same-basename images stay distinct; unnamed MIME-only image embeds as .png; .rtf availability matches platform.
|
Second review round (@chatgpt-codex-connector) addressed in
Tests added: XLSX sheet title isn't a data cell; same-basename images stay distinct; an unnamed MIME-only image embeds as Build (macOS, Swift 6) is running on Generated by Claude Code |
Summary
Testing
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d8f9504801
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| var name = (section.sourcePath as NSString?)?.lastPathComponent ?? section.title | ||
| if name?.isEmpty ?? true { | ||
| generatedCount += 1 | ||
| let ext = OfficeMediaType.fileExtension(forMIME: section.metadata["mimeType"] ?? "") | ||
| let generated = "image-\(generatedCount).\(ext)" | ||
| sections[index].sourcePath = generated | ||
| name = generated | ||
| } | ||
| guard let name, !name.isEmpty else { continue } | ||
| let alt = section.title ?? name | ||
| refs.append(DocumentSection(kind: .body, markdown: ")")) |
There was a problem hiding this comment.
Preserve paths in synthesized image refs
When an image-only result contains two carriers with the same basename under different sourcePaths, this synthesis strips both references down to that basename (for example charts/logo.png and headers/logo.png both become ). The DOCX image index intentionally drops ambiguous basename lookups and only keeps the full paths, so these synthesized references no longer resolve and the exported DOCX falls back to text instead of embedding either image; the fresh evidence is this new synthesis path using lastPathComponent before creating the Markdown ref.
Useful? React with 👍 / 👎.
| case .table(let tableRows): | ||
| rows += tableRows |
There was a problem hiding this comment.
Strip inline Markdown from XLSX table cells
For table rows this appends the parsed cell text directly, unlike headings, paragraphs, and list items which go through plain(_). When transcoding a DOCX/Markdown table cell with inline formatting such as **Total** or a link label, the generated XLSX cell contains the Markdown syntax instead of the visible cell value, corrupting spreadsheet data; map table cells through the same plain-text projection before writing rows.
Useful? React with 👍 / 👎.
| if case .heading(_, let text)? = blocks.first, | ||
| let title = section.sheetName ?? section.title, | ||
| plain(text) == title { | ||
| blocks.removeFirst() |
There was a problem hiding this comment.
Limit sheet-title stripping to sheet sections
This removes any leading heading whose text matches section.sheetName ?? section.title, even for non-spreadsheet sections. A normal titled body/chapter such as DocumentSection(title: "Intro", kind: .body, markdown: "# Intro\n\n...") exported to XLSX loses the Intro row, while only .sheet sections from SpreadsheetConverter have the redundant sheet-title heading that needs dropping.
Useful? React with 👍 / 👎.
| if case .heading(let level, let text) = block, level <= 2 { | ||
| flush() | ||
| title = text |
There was a problem hiding this comment.
Normalize PPTX heading titles before emitting
For Markdown split into slides, the slide title is taken directly from the heading text and later XML-escaped only, while body lines go through plain(_). A heading like # **Q4** results or # [Spec](https://example.com) therefore produces a PPTX title showing literal Markdown syntax instead of the visible title text; run the heading through the same inline plain-text projection or emit rich text runs.
Useful? React with 👍 / 👎.
| case .footnoteReference: | ||
| break // references carry no inline glyph in this projection |
There was a problem hiding this comment.
Preserve footnote markers in RTF exports
On Apple platforms the default RTF writer goes through this builder, and any Markdown/converted DOCX containing [^id] markers loses them here. The body reference is dropped entirely, and a footnote definition like [^fn1]: note becomes just : note, so the exported RTF no longer preserves where the note was referenced or which definition it belongs to; emit a literal marker as the DOCX writer does instead of skipping it.
Useful? React with 👍 / 👎.
Implement the write-side of PicoDocs: convert Markdown and
ConverterResultinto binary office formats, completing the round-trip capability.Summary
This PR adds a complete export pipeline that mirrors the existing read-side converters. Users can now serialize LLM Markdown output or structured document results into DOCX, XLSX, PPTX, and RTF files via
PicoDocsEngine.write(...). The implementation includes hand-rolled OOXML writers for DOCX/XLSX/PPTX (ensuring round-trip fidelity with the existing readers), platform-native exporters for RTF, and shared Markdown intermediate representations used by both renderers and exporters.Key Changes
Shared Markdown IR: Extracted block-level (
MarkdownBlock) and inline (MarkdownInline) parsing fromDocumentRendererinto reusable enums (MarkdownBlockParser,MarkdownInlineParser). Both renderers and exporters now consume the same structured representation, ensuring consistency and simplifying future parser improvements.Hand-rolled OOXML exporters:
WordprocessingMLExporter: Writes minimal-but-valid DOCX files with headings (viaw:pStyle), emphasis (w:b/w:i), lists (w:numPr), tables, links (w:hyperlink), and embedded images. Deliberately mirrorsWordConverter's markers for round-trip validation.XLSXExporter: Generates SpreadsheetML with one sheet per document section. Uses losslessmetadata["csv"]when present; otherwise flattens Markdown to rows (matchingDocumentRenderer.renderCSV).PPTXExporter: Creates PresentationML with one slide per top-level heading or explicit.slidesection. Includes fixed template parts (slide master, layout, theme) viaPPTXTemplates.Platform-native exporters:
AttributedStringRTFExporter: BuildsNSAttributedStringfrom the Markdown IR and serializes to RTF (available across all Apple platforms).AttributedStringDOCXExporter: Optional macOS-only DOCX writer usingNSAttributedString.officeOpenXML; intentionally not registered by default (hand-rolled version is the primary).AttributedStringDocumentBuilder: Shared utility that constructs attributed strings fromConverterResult, supporting headings, emphasis, lists, code, blockquotes, and tables.Infrastructure:
DocumentExporterprotocol: Sendable, synchronous serializers (pure CPU, no I/O).DocumentExporterRegistry: Immutable, priority-ordered registry of exporters (mirrorsDocumentConverterRegistry).OOXMLPackageWriter: In-memory ZIP builder for OOXML files (no temp files; pureDataoutput).ExportableFileType: Enum for writable formats (DOCX, XLSX, PPTX, RTF, Pages, Keynote). Pages/Keynote are marked unimplemented with research notes explaining why third-party iWork writing is impractical.OfficeMediaType: Shared MIME type mapping for image extensions (used by both readers and writers).Engine API: Added
PicoDocsEngine.write(result:to:registry:)andPicoDocsEngine.write(markdown:to:)for serialization.Tests:
ExporterTestsvalidates round-trip (write → re-import with existing converters → compare recovered text) for DOCX and XLSX, and structural validation for PPTX (unzip, required parts, title text).Documentation: Updated README with export examples.
Notable Implementation Details
WordConverter/SpreadsheetConverterand comparing recovered Markdown, using the readers as ground truth.DocumentRenderer's inline helpers remain forhttps://claude.ai/code/session_01MRakJBZVBQ4CcTdcypk9Z6