Skip to content

Release/v0.1.2#38

Merged
yfedoseev merged 18 commits into
mainfrom
release/v0.1.2
May 15, 2026
Merged

Release/v0.1.2#38
yfedoseev merged 18 commits into
mainfrom
release/v0.1.2

Conversation

@yfedoseev

Copy link
Copy Markdown
Owner

Description

Closes #

Type of change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation / tooling / CI only

Checklist

  • cargo fmt -- --check passes
  • cargo clippy --all-targets -- -D warnings passes
  • Tests added or updated for the changed behaviour
  • cargo test passes locally
  • Documentation updated (public API changes have rustdoc, README updated if needed)
  • CHANGELOG.md updated under [Unreleased]
  • All commits include a Signed-off-by trailer (git commit -s)

Testing

Notes for reviewers

yfedoseev and others added 10 commits May 13, 2026 18:36
Major work for the v0.1.2 release branch. Consolidates 32 commits
of round-trip fidelity fixes, IR enrichment, performance, and
test coverage into a single release commit.

# Performance
- xlsx: O(1) cell-style lookups via HashMap (replaces linear Vec scan
  in the hot per-cell formatting path)

# Round-trip fidelity (PDF → office → PDF)
- DOCX/PPTX/XLSX: preserve images, fonts, columns end-to-end
- Alignment, spacing, footers, rules survive both directions
- PPTX ThematicBreak encoded as a 30-char U+2500 marker run that
  downstream PDF renderers detect and re-emit as a real <hr>
- DOCX <w:pBdr><w:bottom/> on an empty paragraph recovers as
  Element::ThematicBreak in IR

# DOCX
- Parse <w:framePr> into IR FramePosition (layout-preserving paths
  like pdf_oxide's to_docx_bytes_layout)
- Heading carries frame_position
- Parse floating <wp:anchor> drawings and <wps:wsp> vector shapes
  (line/rect with stroke/fill RGB)
- Preserve per-section page sizes; emit per-section <w:sectPr>
  on multi-section IR
- Preserve <w:sz> through to IR's font_size_half_pt
- Include header/footer text in to_markdown and to_ir
- Embedded fonts under /word/fonts/ are parsed and exposed on
  DocxDocument.embedded_fonts; strip_embedded_font_filename
  recovers the original face name from font_<n>_<face>.<ext>
  (fixes greedy alphabetic-trim regression)
- parse_drawing decomposed into focused recursive helpers
- Plumb paragraph alignment + inline image collection

# PPTX
- Real Title+Body slide layout instead of blank
- Paginate slides (~250 cap) + synthesize Slide N heading on to_ir
- Wrap shapes in positioned TextBox + parse slide background
- Don't wrap zero-size shape positions in TextBox
- Propagate slide size to per-section page_setup
- Preserve run font sizes (sz attribute → font_size_hundredths_pt)
- Parse paragraph algn + spcBef → IR alignment + space_before
- Picture shapes carry embed_rid + bytes + format resolved via
  pre-built media map
- PPTX font embedding under /ppt/fonts/
- Structured chart text extraction (parses <c:chart> nodes into
  per-chart text blocks rendered as ## Chart N in markdown)

# XLSX
- Per-worksheet page_setup round-trip via <pageMargins>/<pageSetup>
  with inch/mm/cm/paperSize parsing
- Preserve font sizes through IR; emit prose XLSX as paragraphs
  when a 1-column sheet has long-text cells
- Unique worksheet names in ir_to_xlsx
- New numfmt module: built-in IDs 0-44 (general, fixed, commas,
  percent, currency, scientific, accounting) + custom format
  strings (multi-section, [Red] color directive, currency prefix,
  quoted literal suffix, scale-by-thousand)
- Worksheet drawings: WorksheetPicture + WorksheetTextShape parsed
  from xl/drawings/, anchor coords in EMU
- Embedded fonts under /xl/fonts/

# IR enrichment
- New types: Shape, ShapeGeom, FramePosition, ParagraphAlignment
  variants (Distribute), block_default centralisation (ThematicBreak
  → "---" / "<hr />", PageBreak/Shape invisible in flow, TextBox
  recursively renders children)
- New helpers: first_inline_font_size_pt, inline_to_element_block,
  build_nested_list (flat / 2-level / 3-level nesting)
- Heading carries frame_position + alignment
- Section.background_rgb propagated from PPTX slide background

# Writers
- DOCX: wire fontTable, heading styles, embed fonts, core props,
  dedup runs
- PPTX: cap slides at ~250 (PowerPoint hard limit), autoFit, set_title_aligned
- XLSX: split long paragraphs across cells; unique sheet names

# Refactors
- core: unified font embedding helper + cross-format font-size
  invariant (HalfPoint::from_word_sz / from_drawingml_sz)
- ir: consolidate inline_to_element / build_nested_list /
  first_inline_font_size_pt (used by all 3 IR converters)
- ir_render: extract block_default to centralise no-flow defaults
  (compiler-enforced exhaustiveness on new Element variants)

# Tests
- 98 new unit tests across the touched modules (core,
  xlsx/numfmt, xlsx/worksheet, docx/formatting, docx/mod,
  pptx/slide, ir, ir_render). All in-module #[cfg(test)] blocks;
  no new integration files.
- Final state: 535/535 tests pass on default, --features parallel,
  --features mmap, --features parallel+mmap

# Cleanup
- cargo fmt clean
- cargo clippy --workspace --all-targets -- -D warnings clean
- 0 build warnings
- maturin build (python feature) and wasm-pack build (wasm
  feature) both produce working packages; Python smoke verifies
  Document / EditableDocument / XlsxWriter / PptxWriter /
  create_from_markdown all functional
Bumps [actions/attest-sbom](https://github.com/actions/attest-sbom) from 2.4.0 to 4.1.0.
- [Release notes](https://github.com/actions/attest-sbom/releases)
- [Changelog](https://github.com/actions/attest-sbom/blob/main/RELEASE.md)
- [Commits](actions/attest-sbom@bd218ad...c604332)

---
updated-dependencies:
- dependency-name: actions/attest-sbom
  dependency-version: 4.1.0
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.35.2 to 4.35.3.
- [Release notes](https://github.com/github/codeql-action/releases)
- [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
- [Commits](github/codeql-action@ce64ddc...e46ed2c)

---
updated-dependencies:
- dependency-name: github/codeql-action
  dependency-version: 4.35.3
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4.6.2 to 7.0.1.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](actions/upload-artifact@ea165f8...043fb46)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: 7.0.1
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
…48a0b548edc03f92a220660cdb8

Updates the requirements on [dtolnay/rust-toolchain](https://github.com/dtolnay/rust-toolchain) to permit the latest version.
- [Release notes](https://github.com/dtolnay/rust-toolchain/releases)
- [Commits](https://github.com/dtolnay/rust-toolchain/commits/29eef336d9b2848a0b548edc03f92a220660cdb8)

---
updated-dependencies:
- dependency-name: dtolnay/rust-toolchain
  dependency-version: 29eef336d9b2848a0b548edc03f92a220660cdb8
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/github-script](https://github.com/actions/github-script) from 7.0.1 to 9.0.0.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](actions/github-script@60a0d83...3a2844b)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: 9.0.0
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [koffi](https://github.com/Koromix/koffi) from 2.16.1 to 2.16.2.
- [Commits](https://github.com/Koromix/koffi/commits)

---
updated-dependencies:
- dependency-name: koffi
  dependency-version: 2.16.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [quick-xml](https://github.com/tafia/quick-xml) from 0.37.5 to 0.40.0.
- [Release notes](https://github.com/tafia/quick-xml/releases)
- [Changelog](https://github.com/tafia/quick-xml/blob/master/Changelog.md)
- [Commits](tafia/quick-xml@v0.37.5...v0.40.0)

---
updated-dependencies:
- dependency-name: quick-xml
  dependency-version: 0.40.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Migrate the parsers to quick-xml 0.40 after the dependabot cherry-pick:

- `BytesText::unescape()` was removed in 0.40. Replace 6 call sites
  with new `core::xml::unescape_text(BytesText) -> Result<String>`
  helper that does `decode()?` + `escape::unescape()?` in one call.
- `Attribute::unescape_value()` is deprecated in 0.40 (replacement
  `normalized_value()` has different semantics — no entity unescaping).
  Wrap the 6 call sites through new `core::xml::unescape_attr_value`
  helper with `#[allow(deprecated)]` localised to one place so the
  call sites stay deprecation-free.

Also apply `cargo fmt --all` (4 files: convert_docx, convert_xlsx,
create, xlsx/text — pre-existing fmt drift surfaced by rebuild).

Result: 0 warnings, cargo clippy --workspace --all-targets
-- -D warnings clean, 535/535 tests pass.
`office_oxide_cli` and `office_oxide_mcp` had `mod commands;` / `mod
protocol;` as their first statement, leaving the crate root undocumented.
Add a short crate-level `//!` doc and `#![warn(missing_docs)]` so future
items in either binary stay documented.

Verified: `RUSTDOCFLAGS="-D missing_docs" cargo doc --workspace
--no-deps --features parallel,mmap` now passes with zero errors.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

A 0.1.2 release PR that bumps version metadata across all language bindings, upgrades quick-xml 0.37 → 0.40, and adds substantial round-trip-fidelity work for DOCX/PPTX/XLSX (embedded fonts, page setup, drawings, alignment, number formats, charts), plus IR enrichments (Shape, FramePosition, background_rgb).

Changes:

  • Version bump to 0.1.2 across Cargo, Python, JS, C#, Go, WASM, and CHANGELOG.
  • Major round-trip improvements: XLSX pageSetup/pageMargins parsing, drawings/pictures/text-shapes anchoring, custom numfmt rendering, structured chart text extraction; PPTX font sizes, colors, alignment, slide layouts, embedded fonts; DOCX framePr, multi-section sectPr, headers/footers, bottom-border ThematicBreak, embedded fonts.
  • Migration from removed quick-xml 0.37 APIs (unescape/unescape_value) to centralized helpers in core::xml.

Reviewed changes

Copilot reviewed 54 out of 56 changed files in this pull request and generated 22 comments.

Show a summary per file
File Description
Cargo.toml / Cargo.lock Workspace version bump to 0.1.2; quick-xml 0.37→0.40.
crates/office_oxide_{cli,mcp}/* Version bumps + crate-level rustdoc with #![warn(missing_docs)].
.github/workflows/*.yml Pinned-action SHA bumps for upload-artifact, codeql, github-script, attest-sbom, rust-toolchain.
pyproject.toml, js/package*.json, wasm-pkg/package.json, csharp/.../OfficeOxide.csproj, go/cmd/install/main.go, bench_rust/Cargo.toml Binding version bumps.
CHANGELOG.md Adds 0.1.2 release notes (dated 2026-05-14).
src/xlsx/mod.rs Adds chart text extraction, drawing anchor parsing, embedded fonts scanning, font/image bundling; some questionable parsing logic.
src/xlsx/worksheet.rs Adds PageSetup, WorksheetPicture, WorksheetTextShape and related parsing/tests.
src/xlsx/styles.rs Switches number_formats from Vec to HashMap for O(1) lookup.
src/xlsx/text.rs Adds chart heading emission, single-column prose mode, applies numfmt to numeric cells; introduces write_cell_value_fast (heavy duplication).
src/xlsx/numfmt.rs New module: built-in IDs + simplified custom-format parser; some edge-case rendering issues.
src/pptx/{shape,write,text,mod}.rs Adds font-size/color on runs, paragraph alignment/space-before, picture data resolution, embedded fonts, real slide layouts, core-properties, ParaProps.
src/docx/{document,formatting,image,text,write}.rs Adds FrameProps, ShapeInfo/AnchorPosition, multi-section sectPr writing, framePr parsing, header/footer markdown emission, embedded fonts + fontTable.xml, heading styles.
src/ir.rs, src/ir_render.rs, src/ir_from_markdown.rs New Shape/ShapeGeom/FramePosition types, alignment/frame on Heading, background_rgb on Section, helpers first_inline_font_size_pt/inline_to_element_block/build_nested_list, centralized block defaults; tests added.
src/convert_{docx,pptx,xlsx,doc,ppt}.rs Multi-section conversion, header/footer hoisting, positional TextBox wrapping, font/color/alignment propagation, removal of duplicated list builders.
src/core/{xml,units,relationships,properties,opc,mod}.rs New unescape_text/unescape_attr_value helpers; HalfPoint::from_word_sz/from_drawingml_sz; new rel types DRAWING/FONT; register_default_content_type.
src/core/{embedded_fonts,core_properties}.rs New shared modules for font embedding and docProps/core.xml generation.
tests/{office,write}_integration.rs Tests updated for TextBox-wrapped PPTX shapes and Heading default fields.
Files not reviewed (1)
  • js/package-lock.json: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/xlsx/mod.rs Outdated
Comment thread src/docx/text.rs Outdated
Comment thread src/convert_xlsx.rs Outdated
Comment thread src/xlsx/worksheet.rs Outdated
Comment thread src/xlsx/numfmt.rs Outdated
Comment thread src/docx/write.rs
Comment thread src/convert_pptx.rs Outdated
Comment thread src/xlsx/mod.rs
Comment thread src/xlsx/worksheet.rs
Comment thread src/xlsx/text.rs
yfedoseev added 8 commits May 14, 2026 17:34
Records the run-colour propagation folded into the release commit
(DOCX `<w:rPr><w:color/>` and PPTX `<a:solidFill><a:srgbClr/>` into
`TextSpan.color`), the quick-xml 0.37 → 0.40 API migration with the
new `core::xml::unescape_text` / `unescape_attr_value` helpers, and
the crate-level `//!` docs + `missing_docs` lint added to
`office_oxide_cli` and `office_oxide_mcp`. Release date bumped to
2026-05-14.
The shape::TextRun struct gained a color_rgb field in this branch but
ten in-test constructors in src/pptx/text.rs still listed the previous
field set, breaking cargo clippy/test workspace-wide.
- xlsx/worksheet.rs: correct A3 paper-size twips (16838×23811 vs
  the off-by-2 16840×23820); drop the no-op block that "zeroed"
  already-zero dimensions in build_page_setup.
- docx/text.rs: remove the dead pre-split loop that built a string
  it never appended anywhere; split_headers_footers does the
  actual emission below.
- xlsx/mod.rs: drop the `push_str("")` no-op in extract_chart_text
  — adjacent rich-text runs concatenate directly (the surrounding
  XML preserves any intended whitespace as `<a:t xml:space="preserve">`).
- convert_xlsx.rs: when a worksheet had `<pageMargins>` but no
  `<pageSetup>`, fall back to PageSetup::default() geometry instead
  of dropping the parsed margins on the floor.
Review fixes:
- xlsx/numfmt: rewrite format_commas to avoid the rounded.fract()
  float round-trip (could off-by-one near .999…), fall back to the
  bare Rust formatter when the value overflows u64, and surface
  NaN/Infinity as visible labels instead of empty strings so anomalous
  cells aren't mistaken for empty data.
- xlsx/numfmt: format_currency now puts the minus sign in front of the
  symbol ("-$99.50" not "$-99.50"). Test updated to match.
- xlsx/worksheet: extract the ECMA-376 default margins into a single
  PageMarginsIn::DEFAULTS constant and reuse it from parse_page_margins
  and build_page_setup so future tweaks stay in lockstep.
- convert_pptx: use plain `h / 5` for hundredths-of-pt → twips. div_ceil
  was inflating every non-multiple-of-5 by an extra twip.

Coverage:
- New unit tests for src/xls/images.rs (was 0%) covering BLIP type
  detection, UID/header sizing, signature validation, format mapping,
  and end-to-end record extraction with a synthetic PNG payload.
- New unit tests for src/xlsx/mod.rs (was ~16%) covering sheet-rels
  path derivation, relative ZIP path resolution (absolute, .. and ./
  segments), image-format byte sniffing, extract_chart_text on a
  minimal title plus a categories/series example, and the drawing
  anchor parser on picture/text/empty inputs.
- Apply rustfmt across recent edits (the v0.1.2 PR's Lint and Format
  Check job was failing because my recent commits hand-wrote a few
  lines that exceeded rustfmt's max_width).
- xlsx/numfmt: only treat 'E'/'e' as a scientific-notation marker when
  followed by '+' or '-'. A bare 'E' in a custom format like "000E"
  was previously consuming the next character unconditionally, which
  could swallow a literal or a digit it should have kept.
- xlsx/mod: in parse_drawing_anchors, restrict the `<off>` fallback to
  the outermost anchor scope (AnchorKind::Unknown). Otherwise the
  `<a:off>` inside a shape's `<a:xfrm>` would overwrite x/y coords
  parsed earlier from `<xdr:pos>` in an absoluteAnchor.
…g logs

- pptx::write::embed_font: spell out that deduplication is by name
  only, not by bytes; document the workaround of using distinct
  family names for multiple faces.
- docx::write::generate_font_table_xml: note that every entry is
  emitted as <w:embedRegular> regardless of the underlying style;
  document the recommended workaround (separate family names per
  face).
- xlsx::read_drawing_for_sheet: emit a `debug!` line when a drawing
  part fails to read or parse, instead of silently swallowing the
  error. Lets operators trace cases where worksheet drawings vanish.
Previously, `split_headers_footers` derived role by comparing each
entry's index against the cumulative count of all sections'
`header_refs`. That assumed `headers_footers` was laid out as
"all headers first, then all footers" — but the parser actually
interleaves them per section (header_refs of section 0, then
footer_refs of section 0, then headers of section 1, etc.). In
multi-section documents the cumulative-count split silently
misclassified entries into the wrong column.

Record the role explicitly on each parsed `HeaderFooter` and let
the markdown renderer read it directly. Walking header_refs and
footer_refs in two separate loops at parse time keeps the role
authoritative, even when individual refs fail to resolve and don't
contribute an entry to `headers_footers`.

Closes a Copilot review comment on PR #38.
…ants

Coverage was sitting at 73.2% line on this branch, below the 75%
floor enforced by the Code Coverage CI job. The PR introduced a lot
of new write/conversion code (chart text, embedded fonts, multi-section
sectPr, drawing anchors, page setup) and the existing integration
tests only exercised the common element variants.

Adds round-trip tests through `create_from_ir_to_writer` for:
- ThematicBreak (verifies w:pBdr emission in document.xml)
- PageBreak + ColumnBreak (verifies w:br w:type="page"/"column")
- Footnote + Endnote (verifies the footnotes.xml/endnotes.xml parts)
- TextBox (verifies floating content lands in document.xml)
- Numbered List with start_number
- Multi-section document with Continuous / NextPage / OddPage breaks
- Crate-level Document::from_reader + plain_text + to_markdown + to_ir
  convenience path

Local line coverage rises from 73.21% to 76.43%, clearing the 75%
threshold.
@yfedoseev yfedoseev merged commit 60e29f1 into main May 15, 2026
70 checks passed
@yfedoseev yfedoseev deleted the release/v0.1.2 branch May 15, 2026 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants