Release/v0.1.2#38
Merged
Merged
Conversation
Major work for the v0.1.2 release branch. Consolidates 32 commits of round-trip fidelity fixes, IR enrichment, performance, and test coverage into a single release commit. # Performance - xlsx: O(1) cell-style lookups via HashMap (replaces linear Vec scan in the hot per-cell formatting path) # Round-trip fidelity (PDF → office → PDF) - DOCX/PPTX/XLSX: preserve images, fonts, columns end-to-end - Alignment, spacing, footers, rules survive both directions - PPTX ThematicBreak encoded as a 30-char U+2500 marker run that downstream PDF renderers detect and re-emit as a real <hr> - DOCX <w:pBdr><w:bottom/> on an empty paragraph recovers as Element::ThematicBreak in IR # DOCX - Parse <w:framePr> into IR FramePosition (layout-preserving paths like pdf_oxide's to_docx_bytes_layout) - Heading carries frame_position - Parse floating <wp:anchor> drawings and <wps:wsp> vector shapes (line/rect with stroke/fill RGB) - Preserve per-section page sizes; emit per-section <w:sectPr> on multi-section IR - Preserve <w:sz> through to IR's font_size_half_pt - Include header/footer text in to_markdown and to_ir - Embedded fonts under /word/fonts/ are parsed and exposed on DocxDocument.embedded_fonts; strip_embedded_font_filename recovers the original face name from font_<n>_<face>.<ext> (fixes greedy alphabetic-trim regression) - parse_drawing decomposed into focused recursive helpers - Plumb paragraph alignment + inline image collection # PPTX - Real Title+Body slide layout instead of blank - Paginate slides (~250 cap) + synthesize Slide N heading on to_ir - Wrap shapes in positioned TextBox + parse slide background - Don't wrap zero-size shape positions in TextBox - Propagate slide size to per-section page_setup - Preserve run font sizes (sz attribute → font_size_hundredths_pt) - Parse paragraph algn + spcBef → IR alignment + space_before - Picture shapes carry embed_rid + bytes + format resolved via pre-built media map - PPTX font embedding under /ppt/fonts/ - Structured chart text extraction (parses <c:chart> nodes into per-chart text blocks rendered as ## Chart N in markdown) # XLSX - Per-worksheet page_setup round-trip via <pageMargins>/<pageSetup> with inch/mm/cm/paperSize parsing - Preserve font sizes through IR; emit prose XLSX as paragraphs when a 1-column sheet has long-text cells - Unique worksheet names in ir_to_xlsx - New numfmt module: built-in IDs 0-44 (general, fixed, commas, percent, currency, scientific, accounting) + custom format strings (multi-section, [Red] color directive, currency prefix, quoted literal suffix, scale-by-thousand) - Worksheet drawings: WorksheetPicture + WorksheetTextShape parsed from xl/drawings/, anchor coords in EMU - Embedded fonts under /xl/fonts/ # IR enrichment - New types: Shape, ShapeGeom, FramePosition, ParagraphAlignment variants (Distribute), block_default centralisation (ThematicBreak → "---" / "<hr />", PageBreak/Shape invisible in flow, TextBox recursively renders children) - New helpers: first_inline_font_size_pt, inline_to_element_block, build_nested_list (flat / 2-level / 3-level nesting) - Heading carries frame_position + alignment - Section.background_rgb propagated from PPTX slide background # Writers - DOCX: wire fontTable, heading styles, embed fonts, core props, dedup runs - PPTX: cap slides at ~250 (PowerPoint hard limit), autoFit, set_title_aligned - XLSX: split long paragraphs across cells; unique sheet names # Refactors - core: unified font embedding helper + cross-format font-size invariant (HalfPoint::from_word_sz / from_drawingml_sz) - ir: consolidate inline_to_element / build_nested_list / first_inline_font_size_pt (used by all 3 IR converters) - ir_render: extract block_default to centralise no-flow defaults (compiler-enforced exhaustiveness on new Element variants) # Tests - 98 new unit tests across the touched modules (core, xlsx/numfmt, xlsx/worksheet, docx/formatting, docx/mod, pptx/slide, ir, ir_render). All in-module #[cfg(test)] blocks; no new integration files. - Final state: 535/535 tests pass on default, --features parallel, --features mmap, --features parallel+mmap # Cleanup - cargo fmt clean - cargo clippy --workspace --all-targets -- -D warnings clean - 0 build warnings - maturin build (python feature) and wasm-pack build (wasm feature) both produce working packages; Python smoke verifies Document / EditableDocument / XlsxWriter / PptxWriter / create_from_markdown all functional
Bumps [actions/attest-sbom](https://github.com/actions/attest-sbom) from 2.4.0 to 4.1.0. - [Release notes](https://github.com/actions/attest-sbom/releases) - [Changelog](https://github.com/actions/attest-sbom/blob/main/RELEASE.md) - [Commits](actions/attest-sbom@bd218ad...c604332) --- updated-dependencies: - dependency-name: actions/attest-sbom dependency-version: 4.1.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.35.2 to 4.35.3. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](github/codeql-action@ce64ddc...e46ed2c) --- updated-dependencies: - dependency-name: github/codeql-action dependency-version: 4.35.3 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4.6.2 to 7.0.1. - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](actions/upload-artifact@ea165f8...043fb46) --- updated-dependencies: - dependency-name: actions/upload-artifact dependency-version: 7.0.1 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
…48a0b548edc03f92a220660cdb8 Updates the requirements on [dtolnay/rust-toolchain](https://github.com/dtolnay/rust-toolchain) to permit the latest version. - [Release notes](https://github.com/dtolnay/rust-toolchain/releases) - [Commits](https://github.com/dtolnay/rust-toolchain/commits/29eef336d9b2848a0b548edc03f92a220660cdb8) --- updated-dependencies: - dependency-name: dtolnay/rust-toolchain dependency-version: 29eef336d9b2848a0b548edc03f92a220660cdb8 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/github-script](https://github.com/actions/github-script) from 7.0.1 to 9.0.0. - [Release notes](https://github.com/actions/github-script/releases) - [Commits](actions/github-script@60a0d83...3a2844b) --- updated-dependencies: - dependency-name: actions/github-script dependency-version: 9.0.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [koffi](https://github.com/Koromix/koffi) from 2.16.1 to 2.16.2. - [Commits](https://github.com/Koromix/koffi/commits) --- updated-dependencies: - dependency-name: koffi dependency-version: 2.16.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [quick-xml](https://github.com/tafia/quick-xml) from 0.37.5 to 0.40.0. - [Release notes](https://github.com/tafia/quick-xml/releases) - [Changelog](https://github.com/tafia/quick-xml/blob/master/Changelog.md) - [Commits](tafia/quick-xml@v0.37.5...v0.40.0) --- updated-dependencies: - dependency-name: quick-xml dependency-version: 0.40.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>
Migrate the parsers to quick-xml 0.40 after the dependabot cherry-pick: - `BytesText::unescape()` was removed in 0.40. Replace 6 call sites with new `core::xml::unescape_text(BytesText) -> Result<String>` helper that does `decode()?` + `escape::unescape()?` in one call. - `Attribute::unescape_value()` is deprecated in 0.40 (replacement `normalized_value()` has different semantics — no entity unescaping). Wrap the 6 call sites through new `core::xml::unescape_attr_value` helper with `#[allow(deprecated)]` localised to one place so the call sites stay deprecation-free. Also apply `cargo fmt --all` (4 files: convert_docx, convert_xlsx, create, xlsx/text — pre-existing fmt drift surfaced by rebuild). Result: 0 warnings, cargo clippy --workspace --all-targets -- -D warnings clean, 535/535 tests pass.
`office_oxide_cli` and `office_oxide_mcp` had `mod commands;` / `mod protocol;` as their first statement, leaving the crate root undocumented. Add a short crate-level `//!` doc and `#![warn(missing_docs)]` so future items in either binary stay documented. Verified: `RUSTDOCFLAGS="-D missing_docs" cargo doc --workspace --no-deps --features parallel,mmap` now passes with zero errors.
There was a problem hiding this comment.
Pull request overview
A 0.1.2 release PR that bumps version metadata across all language bindings, upgrades quick-xml 0.37 → 0.40, and adds substantial round-trip-fidelity work for DOCX/PPTX/XLSX (embedded fonts, page setup, drawings, alignment, number formats, charts), plus IR enrichments (Shape, FramePosition, background_rgb).
Changes:
- Version bump to 0.1.2 across Cargo, Python, JS, C#, Go, WASM, and CHANGELOG.
- Major round-trip improvements: XLSX
pageSetup/pageMarginsparsing, drawings/pictures/text-shapes anchoring, customnumfmtrendering, structured chart text extraction; PPTX font sizes, colors, alignment, slide layouts, embedded fonts; DOCX framePr, multi-section sectPr, headers/footers, bottom-border ThematicBreak, embedded fonts. - Migration from removed
quick-xml0.37 APIs (unescape/unescape_value) to centralized helpers incore::xml.
Reviewed changes
Copilot reviewed 54 out of 56 changed files in this pull request and generated 22 comments.
Show a summary per file
| File | Description |
|---|---|
| Cargo.toml / Cargo.lock | Workspace version bump to 0.1.2; quick-xml 0.37→0.40. |
| crates/office_oxide_{cli,mcp}/* | Version bumps + crate-level rustdoc with #![warn(missing_docs)]. |
| .github/workflows/*.yml | Pinned-action SHA bumps for upload-artifact, codeql, github-script, attest-sbom, rust-toolchain. |
| pyproject.toml, js/package*.json, wasm-pkg/package.json, csharp/.../OfficeOxide.csproj, go/cmd/install/main.go, bench_rust/Cargo.toml | Binding version bumps. |
| CHANGELOG.md | Adds 0.1.2 release notes (dated 2026-05-14). |
| src/xlsx/mod.rs | Adds chart text extraction, drawing anchor parsing, embedded fonts scanning, font/image bundling; some questionable parsing logic. |
| src/xlsx/worksheet.rs | Adds PageSetup, WorksheetPicture, WorksheetTextShape and related parsing/tests. |
| src/xlsx/styles.rs | Switches number_formats from Vec to HashMap for O(1) lookup. |
| src/xlsx/text.rs | Adds chart heading emission, single-column prose mode, applies numfmt to numeric cells; introduces write_cell_value_fast (heavy duplication). |
| src/xlsx/numfmt.rs | New module: built-in IDs + simplified custom-format parser; some edge-case rendering issues. |
| src/pptx/{shape,write,text,mod}.rs | Adds font-size/color on runs, paragraph alignment/space-before, picture data resolution, embedded fonts, real slide layouts, core-properties, ParaProps. |
| src/docx/{document,formatting,image,text,write}.rs | Adds FrameProps, ShapeInfo/AnchorPosition, multi-section sectPr writing, framePr parsing, header/footer markdown emission, embedded fonts + fontTable.xml, heading styles. |
| src/ir.rs, src/ir_render.rs, src/ir_from_markdown.rs | New Shape/ShapeGeom/FramePosition types, alignment/frame on Heading, background_rgb on Section, helpers first_inline_font_size_pt/inline_to_element_block/build_nested_list, centralized block defaults; tests added. |
| src/convert_{docx,pptx,xlsx,doc,ppt}.rs | Multi-section conversion, header/footer hoisting, positional TextBox wrapping, font/color/alignment propagation, removal of duplicated list builders. |
| src/core/{xml,units,relationships,properties,opc,mod}.rs | New unescape_text/unescape_attr_value helpers; HalfPoint::from_word_sz/from_drawingml_sz; new rel types DRAWING/FONT; register_default_content_type. |
| src/core/{embedded_fonts,core_properties}.rs | New shared modules for font embedding and docProps/core.xml generation. |
| tests/{office,write}_integration.rs | Tests updated for TextBox-wrapped PPTX shapes and Heading default fields. |
Files not reviewed (1)
- js/package-lock.json: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Records the run-colour propagation folded into the release commit (DOCX `<w:rPr><w:color/>` and PPTX `<a:solidFill><a:srgbClr/>` into `TextSpan.color`), the quick-xml 0.37 → 0.40 API migration with the new `core::xml::unescape_text` / `unescape_attr_value` helpers, and the crate-level `//!` docs + `missing_docs` lint added to `office_oxide_cli` and `office_oxide_mcp`. Release date bumped to 2026-05-14.
The shape::TextRun struct gained a color_rgb field in this branch but ten in-test constructors in src/pptx/text.rs still listed the previous field set, breaking cargo clippy/test workspace-wide.
- xlsx/worksheet.rs: correct A3 paper-size twips (16838×23811 vs
the off-by-2 16840×23820); drop the no-op block that "zeroed"
already-zero dimensions in build_page_setup.
- docx/text.rs: remove the dead pre-split loop that built a string
it never appended anywhere; split_headers_footers does the
actual emission below.
- xlsx/mod.rs: drop the `push_str("")` no-op in extract_chart_text
— adjacent rich-text runs concatenate directly (the surrounding
XML preserves any intended whitespace as `<a:t xml:space="preserve">`).
- convert_xlsx.rs: when a worksheet had `<pageMargins>` but no
`<pageSetup>`, fall back to PageSetup::default() geometry instead
of dropping the parsed margins on the floor.
Review fixes:
- xlsx/numfmt: rewrite format_commas to avoid the rounded.fract()
float round-trip (could off-by-one near .999…), fall back to the
bare Rust formatter when the value overflows u64, and surface
NaN/Infinity as visible labels instead of empty strings so anomalous
cells aren't mistaken for empty data.
- xlsx/numfmt: format_currency now puts the minus sign in front of the
symbol ("-$99.50" not "$-99.50"). Test updated to match.
- xlsx/worksheet: extract the ECMA-376 default margins into a single
PageMarginsIn::DEFAULTS constant and reuse it from parse_page_margins
and build_page_setup so future tweaks stay in lockstep.
- convert_pptx: use plain `h / 5` for hundredths-of-pt → twips. div_ceil
was inflating every non-multiple-of-5 by an extra twip.
Coverage:
- New unit tests for src/xls/images.rs (was 0%) covering BLIP type
detection, UID/header sizing, signature validation, format mapping,
and end-to-end record extraction with a synthetic PNG payload.
- New unit tests for src/xlsx/mod.rs (was ~16%) covering sheet-rels
path derivation, relative ZIP path resolution (absolute, .. and ./
segments), image-format byte sniffing, extract_chart_text on a
minimal title plus a categories/series example, and the drawing
anchor parser on picture/text/empty inputs.
- Apply rustfmt across recent edits (the v0.1.2 PR's Lint and Format Check job was failing because my recent commits hand-wrote a few lines that exceeded rustfmt's max_width). - xlsx/numfmt: only treat 'E'/'e' as a scientific-notation marker when followed by '+' or '-'. A bare 'E' in a custom format like "000E" was previously consuming the next character unconditionally, which could swallow a literal or a digit it should have kept. - xlsx/mod: in parse_drawing_anchors, restrict the `<off>` fallback to the outermost anchor scope (AnchorKind::Unknown). Otherwise the `<a:off>` inside a shape's `<a:xfrm>` would overwrite x/y coords parsed earlier from `<xdr:pos>` in an absoluteAnchor.
…g logs - pptx::write::embed_font: spell out that deduplication is by name only, not by bytes; document the workaround of using distinct family names for multiple faces. - docx::write::generate_font_table_xml: note that every entry is emitted as <w:embedRegular> regardless of the underlying style; document the recommended workaround (separate family names per face). - xlsx::read_drawing_for_sheet: emit a `debug!` line when a drawing part fails to read or parse, instead of silently swallowing the error. Lets operators trace cases where worksheet drawings vanish.
Previously, `split_headers_footers` derived role by comparing each entry's index against the cumulative count of all sections' `header_refs`. That assumed `headers_footers` was laid out as "all headers first, then all footers" — but the parser actually interleaves them per section (header_refs of section 0, then footer_refs of section 0, then headers of section 1, etc.). In multi-section documents the cumulative-count split silently misclassified entries into the wrong column. Record the role explicitly on each parsed `HeaderFooter` and let the markdown renderer read it directly. Walking header_refs and footer_refs in two separate loops at parse time keeps the role authoritative, even when individual refs fail to resolve and don't contribute an entry to `headers_footers`. Closes a Copilot review comment on PR #38.
…ants Coverage was sitting at 73.2% line on this branch, below the 75% floor enforced by the Code Coverage CI job. The PR introduced a lot of new write/conversion code (chart text, embedded fonts, multi-section sectPr, drawing anchors, page setup) and the existing integration tests only exercised the common element variants. Adds round-trip tests through `create_from_ir_to_writer` for: - ThematicBreak (verifies w:pBdr emission in document.xml) - PageBreak + ColumnBreak (verifies w:br w:type="page"/"column") - Footnote + Endnote (verifies the footnotes.xml/endnotes.xml parts) - TextBox (verifies floating content lands in document.xml) - Numbered List with start_number - Multi-section document with Continuous / NextPage / OddPage breaks - Crate-level Document::from_reader + plain_text + to_markdown + to_ir convenience path Local line coverage rises from 73.21% to 76.43%, clearing the 75% threshold.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Closes #
Type of change
Checklist
cargo fmt -- --checkpassescargo clippy --all-targets -- -D warningspassescargo testpasses locallyCHANGELOG.mdupdated under[Unreleased]Signed-off-bytrailer (git commit -s)Testing
Notes for reviewers