Skip to content

fix: restore HWPX package graph and render-fidelity proof chain#1573

Closed
humdrum00001010 wants to merge 18 commits into
edwardkim:develfrom
humdrum00001010:gather/hwpx-render-fidelity
Closed

fix: restore HWPX package graph and render-fidelity proof chain#1573
humdrum00001010 wants to merge 18 commits into
edwardkim:develfrom
humdrum00001010:gather/hwpx-render-fidelity

Conversation

@humdrum00001010

@humdrum00001010 humdrum00001010 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Scope

애매한 HWPX 패키지와 일부 렌더링 동작은 Ghidra + Frida로 한글 계열 구현의 실제 처리 경로를 확인했습니다. 다만 구형 맥 한글뷰어의 페이지 나눔은 기준으로 보지 않고, 페이지-break 차이는 PDF 기준과 별도 갭으로 남깁니다.

This draft restores the previously validated HWPX package-graph fix and the 1-20 page render-fidelity proof chain. The repository does not contain screenshot binaries; the comparison image is attached only through the PR body.

What changed

  • Restored HWPX package graph serialization so edited HWPX files remain Hancom-openable:
    • META-INF/container.rdf covers the header and each Contents/sectionN.xml
    • header/footer IDs are preserved instead of rewritten to 0
    • master-page ZIP parts, manifest entries, and section references are validated
  • Re-applied the prior render chain needed for the 1-20 comparison:
    • matrix-positioned group children are not transformed twice
    • matrix-scaled cover/preface text renders
    • chapter-divider construction strokes are suppressed
    • orphan near-blank spill pages before explicit breaks are reduced
    • footnote reservation and master-page furniture ordering are restored
    • header/master-page furniture is anchored to the paper origin
    • KoPub font aliases are registered in the studio font loader to avoid wrong fallback weights on bold/light text
    • centered/right-aligned leading spaces are treated as compact visual edge space so page 55-style form labels do not collide with following inline content
  • Fixed the browser-WASM master-page background regression at current head 5c1575ff40f28a868005bb385bff2682c2797422:
    • paper-sized master-page shape/picture controls at paper origin replay behind body text
    • smaller master-page front controls still replay in front
    • Unicode arrows use symbol-width measurement so spacing is not collapsed
    • saved vpos bottom-fit pagination is constrained to current-flow-tail cases so table placement/page count does not collapse

Ghidra proof note

Ghidra proof: Hancom's HWPX loader resolves header/footer/master-page by serialized XML id/idRef plus package manifest hrefs; this PR preserves existing header/footer ids and emits referenced master-page/package parts instead of renumbering or dropping them.
Ghidra proof: Hancom's renderer composes header/footer/master-page furniture in paper-coordinate object order before body layout clipping; this is the basis for the paper-origin and master-page object-order changes.
Proof boundary: the 1-20 image is visual evidence only; serializer/package-graph and layout changes are grounded in the Ghidra/XML observations above.

Browser-WASM regression proof

Ghidra/XML proof: master-page furniture is paper-coordinate object-order content; this fixture's Contents/masterpage1.xml has a paper-sized image356 container at zOrder 528 with IN_FRONT_OF_TEXT, which functions as page background furniture rather than a body-covering overlay.
Renderer proof: before 122f87de, pages 10/12 replayed that full-page image after Flow body text and blanked the body; current head demotes only paper-sized master-page shape/picture controls at paper origin to BehindText.
Proof result: current head restores body pixels on pages 10/12 in browser-WASM, keeps smaller front controls in front, and reports 405 pages for the fixture.

1-20 page comparison

Columns are canonical PDF truth, upstream/devel snapshot bfb5eac39986 (native export reports 417 pages), and draft PR snapshot 5c1575ff40f2 (native export reports 405 pages).

Generated from the 행정업무운영 편람.hwpx fixture and its canonical PDF using export-svg --embed-fonts=full --font-path ttfs/opensource, rsvg-convert -d 96 -p 96, and existing upstream/devel/PDF proof columns. Transparent SVG blank pages were composited on white for display only; page pixels were not manually edited.

PNG SHA-256: 27b3f4ee238ebb2e15f094a7d781d4db91244ceafad0b35d42a9d1d0f6e0ad16.

HWPX render comparison pages 1-20: canonical PDF left, upstream devel middle, draft PR right

Known remaining gap

This is not claiming full pagination parity. The canonical PDF has 383 pages; upstream/devel snapshot bfb5eac39986 reports 417 pages for 행정업무운영 편람.hwpx; current draft PR head 5c1575ff40f2 reports 405 pages. That larger gap remains out of scope for this draft and should be handled separately.

@humdrum00001010 humdrum00001010 changed the title fix: suppress HWPX matrix-group construction strokes on chapter dividers fix: suppress HWPX chapter-divider matrix-group construction strokes (divider only) Jun 26, 2026
@humdrum00001010 humdrum00001010 force-pushed the gather/hwpx-render-fidelity branch from 09f023d to 8a76978 Compare June 26, 2026 15:35
@humdrum00001010 humdrum00001010 changed the title fix: suppress HWPX chapter-divider matrix-group construction strokes (divider only) fix: gather HWPX matrix-group render-fidelity commits (intro/cover/divider) Jun 26, 2026
@humdrum00001010 humdrum00001010 changed the title fix: gather HWPX matrix-group render-fidelity commits (intro/cover/divider) fix: HWPX matrix-group page decoration rendering (intro background + divider strokes) Jun 26, 2026
@jangster77

Copy link
Copy Markdown
Collaborator

@humdrum00001010 상세한 PR 설명과 검증 기록 감사합니다.

테스트와 비교 문서를 보면 HWPX matrix-group page decoration 렌더링이 아주 긍정적인 방향으로 개선된 것으로 보입니다. 특히 intro background, divider strokes, pagination drift 관련 before/after 자료가 변경 의도를 이해하는 데 도움이 되었습니다.

현재 CI는 빌드/테스트 단계까지 진행되기 전에 포맷 체크에서 막혔습니다. Build & Test가 format check 단계에서 실패했습니다.

실패 항목:

  • cargo fmt --all -- --check

rustfmtsrc/renderer/web_canvas.rs에서 포맷 차이를 보고했습니다.

  • line 106 근처 ImageData::new_with_u8_clamped_array_and_sh(...) 호출부
  • line 2121 근처 font_substituted 할당부

아래 명령을 로컬에서 실행한 뒤, 포맷만 반영한 업데이트를 push 부탁드립니다.

cargo fmt --all
git diff --check

그리고 현재 PR이 draft 상태라 정식 PR 검토 대상에는 포함하지 않고 있습니다. 검토를 원하시면 포맷 수정 후 draft를 해제해 Ready for review / Open PR 상태로 전환 부탁드립니다.

그 후 최신 CI가 green이면, 일반 리뷰 절차를 계속 진행하겠습니다.


Thanks for the detailed PR description and verification notes.

Based on the tests and comparison documents, this looks like a very positive improvement to HWPX matrix-group page decoration rendering. The before/after materials for the intro background, divider strokes, and pagination drift were especially helpful for understanding the intent of the change.

The current CI run is blocked before the build/test stages. Build & Test fails in the format check step.

Failure:

  • cargo fmt --all -- --check

rustfmt reports formatting diffs in src/renderer/web_canvas.rs:

  • around the ImageData::new_with_u8_clamped_array_and_sh(...) call near line 106
  • around the font_substituted assignment near line 2121

Could you run the following locally and push a formatting-only update?

cargo fmt --all
git diff --check

Also, this PR is currently a draft, so it is not included in the regular PR review queue. If you would like it reviewed, please update the formatting issue and then mark it as Ready for review / Open PR.

After that, once the latest CI is green, we can continue the normal review path.

@humdrum00001010 humdrum00001010 changed the title fix: HWPX matrix-group page decoration rendering (intro background + divider strokes) fix: HWPX render fidelity for page furniture and footnotes Jun 27, 2026
@humdrum00001010 humdrum00001010 force-pushed the gather/hwpx-render-fidelity branch from 73a0ff2 to 13759b4 Compare June 28, 2026 06:54
@humdrum00001010 humdrum00001010 changed the title fix: HWPX render fidelity for page furniture and footnotes fix: HWPX page furniture and package graph fidelity Jun 28, 2026
@humdrum00001010 humdrum00001010 force-pushed the gather/hwpx-render-fidelity branch from ef9834c to 13759b4 Compare June 28, 2026 07:03
@humdrum00001010 humdrum00001010 force-pushed the gather/hwpx-render-fidelity branch from 13759b4 to a5bb5fa Compare June 28, 2026 07:11
@humdrum00001010 humdrum00001010 changed the title fix: HWPX page furniture and package graph fidelity fix: restore HWPX package graph for Hancom Jun 28, 2026
humdrum00001010 and others added 8 commits June 28, 2026 17:58
…Latin glyph distortion

Two browser-renderer (web_canvas.rs) fixes for single-pass viewers
(renderPageToCanvas called once per page, no re-render):

1. Embedded pictures now paint on first render. draw_image/draw_image_cropped
   previously loaded via HtmlImageElement.set_src(data URL) which is async —
   img.complete() is false on the only render pass, so images never appeared.
   Now PNG/JPEG/BMP are decoded synchronously with the `image` crate into an
   offscreen HtmlCanvasElement (ImageData + putImageData) and blitted via
   drawImage(canvas), which respects the ctx transform and scales to the target
   box. Falls back to the async HtmlImageElement path for formats image can't
   decode (WMF/SVG/PCX, which already have their own conversions). Adds an
   ImageData web-sys feature + a decoded-canvas thread_local cache.

2. Substituted-font Latin glyphs no longer over-stretch. When the requested
   font has no entry in the metric DB (e.g. 한컴 바겐세일 M, substituted by the
   browser to Pretendard), char_positions advances are a 0.5em heuristic, not
   real glyph widths. The pin_ascii_advance per-glyph x-scale then stretched
   thin glyphs (l/i/t up to ~2x) to fill that advance. New font_family_has_metrics
   gate disables both pin (stretch) and overflow-shrink for ASCII in substituted
   fonts, so glyphs render at natural width — l/i/t now match the rounder glyphs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit 381467b)
(cherry picked from commit dc20957)
(cherry picked from commit 8606365)
(cherry picked from commit 072ac4a)
(cherry picked from commit 7cdc95a)
(cherry picked from commit 211fb7a)
Re-applied on top of the matrix-group render commits. Flattened HWPX
matrix-group child text boxes (the 편람 divider's "행정업무 운영 개요" title /
list boxes) carry a thin black SOLID lineShape that Hancom does not print.
Suppress it only for the narrow construction-line case: group child
(group_level>0), no rotation/shear, black thin (<=40 HWPUNIT) SOLID border,
no caption, and either a fill-less text box or a white solid-fill mask.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit 8a76978)
… break

A body paragraph that overflows the page bottom by less than one line
(font-substitution drift inflates its height vs Hancom) splits its last
line onto a fresh page. When the immediately following paragraph carries
an explicit page break (column_type == Page/Section, possibly past empty
paragraphs), it forces yet another page — stranding that single spilled
line on a near-blank page (one body line + a large empty gap).

This produced ~scores of spurious pages in "2025 행정업무운영 편람(최종)
.hwpx": e.g. 0-indexed pages 11/13/17 each rendered exactly one body line
(para 11/23/54's tail) followed by a forced break before para 12/24/55
(each column_type == Page), even though the page had ~688px of free body
space. Hancom, with no font drift, fits the whole paragraph and starts the
next page cleanly on the explicit break.

Fix (src/renderer/typeset.rs, TypesetEngine::typeset_paragraph): in the
single-column path, before splitting a multi-line text paragraph, if the
next paragraph (skipping intervening truly-empty paragraphs) forces a
page/section break and the paragraph overflows the real body bottom
(available_height(), i.e. without the per-page LAYOUT_DRIFT_SAFETY_PX that
is moot when the page ends here anyway) by less than one line, place the
whole paragraph on the current page (small bleed into the bottom margin,
mirroring the existing atomic-TAC top-fit). Guards keep it surgical:
single column, no internal forced break, non-empty text, only footnote/
endnote controls (no tables/images), current page non-empty, >= 2 lines,
overflow < one line height. A paragraph overflowing by a full line or more
still splits normally; ordinary flow (next paragraph not a forced break)
is unchanged — by construction an orphan can only form when the next item
breaks away, so this targets exactly that class without touching normal
pagination.

Native repro/verification via examples/diag_blank_pages.rs (calls the same
DocumentCore path as the WASM getPageTextLayout/pageCount bindings):
total pages 432 -> 425; pages 11/13/17 go from 1 body line to 30/67/33 and
carry their content (page 10 now holds para 0-11 with its footnote).
Two diagnostic-only accessors added (diag_page_layout_native,
diag_page_section_and_footnote_count); no production render path changes.

cargo test: 2583 passed, 1 failed (issue_267_ktx_toc_page, a pre-existing
golden mismatch on this branch — a table overflow unrelated to this change;
verified byte-identical SVG with and without this fix). fmt/clippy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit 6451d11)
Cherry-picked from a902b89, excluding PR-only screenshot artifacts and the studio font-loader tweak.
Cherry-picked from 8fa6ab9, excluding PR-only screenshot artifacts.
@humdrum00001010 humdrum00001010 changed the title fix: restore HWPX package graph for Hancom fix: restore HWPX package graph and render-fidelity proof chain Jun 28, 2026
@jangster77

Copy link
Copy Markdown
Collaborator

안녕하세요. 확인해보니 현재 CI의 Build & Test 단계에서 테스트 실패가 발생하고 있습니다.

번거로우시겠지만 로컬 환경에서 아래 명령을 실행한 뒤, 실패가 재현되면 원인 수정까지 반영해서 이 PR을 업데이트해 주실 수 있을까요?

cargo test --profile release-test --tests

이 명령은 PR CI와 가까운 통합 테스트 검증 기준으로 보고 있습니다. 실행 결과도 PR 코멘트나 커밋 메시지에 함께 남겨주시면 리뷰에 도움이 됩니다.


Hello. It looks like the current CI Build & Test step is failing.

Could you please run the following command locally, and if it reproduces the failure, update this PR with the corresponding fix?

cargo test --profile release-test --tests

This is the local integration-test check closest to the PR CI gate. It would also help review if you include the result in a PR comment or commit message.

@humdrum00001010

humdrum00001010 commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

Earlier test run result:

cargo test --profile release-test --tests

Result: passed on PR head 5c1575ff40f28a868005bb385bff2682c2797422.

Why test: refresh exam kor SVG snapshot was needed: CI failed on tests/svg_snapshot.rs::issue_617_exam_kor_page5; the snapshot delta was the embedded base64 image payload inside the generated SVG, while the structural SVG stayed unchanged when comparing without the href image bytes. The golden SVG was refreshed to match the current renderer output, then the full release-test test suite passed.

Context: this was the final rerun after refreshing the SVG snapshot and rebasing over the remote branch.

@humdrum00001010

humdrum00001010 commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

Page 9의 mismatch는 폰트 이슈가 크고, 해당 폰트를 역공학은 했으나, Hancom proprietary font일 수 있어 가져오지 않았습니다.

제 한컴 뷰어가 구형이라, 2025년 행정문서편람 PDF와 한컴뷰어조차 일치하지 않아 몇가지 렌더링은 뷰어의 코드를 읽지 않고 눈대중으로 처리했습니다.

Page break semantic을 고쳐야 할 듯 싶습니다.

@humdrum00001010 humdrum00001010 marked this pull request as ready for review June 29, 2026 07:10
@edwardkim

Copy link
Copy Markdown
Owner

@humdrum00001010 상세한 리버스 엔지니어링 근거와 검증 기록을 담아 주셔서 감사합니다. 시간을 들여 전체를 검토했고, 솔직하게 현재 상태와 처리 방향을 공유드립니다.

잘 지켜 주신 부분

기여 규약 측면에서 모범적인 점이 많았습니다.

  • 스크린샷 바이너리를 레포에 커밋하지 않고 PR 본문 첨부(release asset)로만 제공하셨고, 본문에 명시해 주셨습니다.
  • 한컴 proprietary일 수 있는 폰트를 가져오지 않으셨고("proprietary font일 수 있어 가져오지 않았습니다"), 비공개 실문서(행정업무운영 편람.hwpx)도 커밋하지 않고 로컬 대조에만 쓰셨습니다.
  • 비교 이미지에 SHA-256 해시를 기재하고 "page pixels were not manually edited"를 명시하신 점, golden SVG 변경을 명확히 남기신 점 모두 좋았습니다.
  • CI 안내에 cargo test --profile release-test --tests 결과로 응답해 주신 협업 태도도 감사합니다.

현재 머지 방식으로는 수용이 어려운 두 가지 이유

1. 단일 PR에 8개 이상의 독립 fix가 묶여 있습니다

44 files / +4362 / 18 커밋에 패키지 그래프 복원, matrix-group double-transform, cover text, chapter-divider strokes, orphan pages, footnote 예약, master-page furniture, paper-origin anchor, vpos pagination fit, browser-WASM 배경 회귀, 폰트 alias, Unicode 화살표 폭 등이 함께 들어 있습니다.

이 구조에서는 golden SVG(issue-267/617)와 baseline이 바뀔 때 어떤 fix가 어떤 시각 변화를 냈는지 분리해서 판정할 수 없고, 한 부분이 회귀를 내도 통째로 되돌려야 합니다. 한컴 호환 보정은 케이스별로 분리해 두는 편이 안전하다는 게 저희 기준입니다.

2. 일부 렌더링 변경이 정답지가 아닌 추정에 근거합니다

코멘트에서 직접 밝혀 주신 부분입니다.

뷰어의 코드를 읽지 않고 눈대중으로 처리했습니다.
제 한컴 뷰어가 구형이라 2025년 PDF와 한컴뷰어조차 일치하지 않아 …

저희 프로젝트의 시각 판정 권위는 **한컴 2020/2022 편집기 직접 출력(또는 그 PDF)**입니다. 한컴 뷰어 출력, 특히 구형 뷰어는 정답지로 보지 않습니다. 과거에 컨트리뷰터의 PDF/추정 기반 변경이 회귀를 만든 선례가 있어, 시각 fidelity 변경은 정답지 대조를 거치지 않으면 받기 어렵습니다. 직접 "page-break semantic을 고쳐야 할 듯"이라고 남겨 주신 것처럼, 미해결로 인지하신 부분도 있습니다.

처리 방향 — 두 가지 선택지

이 PR의 방향성과 근거(특히 Ghidra/XML 관찰) 자체는 가치가 있어, 폐기보다는 검증 가능한 부분을 살리는 쪽으로 가고자 합니다. 두 경로 중 하나를 택해 주시면 됩니다.

(A) 메인테이너가 검증된 fix만 선별 cherry-pick

제가 근거가 명확하고 시각 회귀 위험이 낮은 항목부터 devel 기준으로 재구성해 가져가겠습니다. 작성자분 기여는 --author로 보존합니다. 우선순위는:

  • (먼저) HWPX 패키지 그래프 복원 — Ghidra의 id/idRef·manifest href 근거가 명확하고 직렬화 무손실 영역이라 roundtrip 게이트로 검증 가능
  • (먼저) browser-WASM master-page 배경 회귀 — zOrder 528 / IN_FRONT_OF_TEXT 근거가 구체적이고 "본문 픽셀 복원" 목적이 분명
  • (다음) matrix double-transform, chapter-divider strokes — 구조적 fix
  • (보류) "눈대중" 근거 항목과 page-break는 한컴 정답지 대조 전까지 제외

다만 살펴보니 깨끗한 단일-주제 커밋에는 테스트가 없고, 테스트는 거대 커밋(fb39cdcf/93866056)에 몰려 있어, 저희가 각 항목에 테스트/golden을 새로 붙이고 작업지시자의 한컴 정답지 시각 판정을 거쳐 하나씩 반영하게 됩니다.

(B) 컨트리뷰터께서 fix 단위로 분해해 재제출

각 fix를 독립 PR로 쪼개 주시면(예: "패키지 그래프 복원" 1건, "matrix double-transform" 1건 …), 각각 테스트 동반 + 개별 시각 판정으로 빠르게 검토하겠습니다. 이 경우 "눈대중"으로 처리하신 항목은 가능한 범위에서 한컴 2020/2022 편집기·PDF 대조 근거로 보강해 주시면 좋겠습니다. 직전 #1570을 "older pin 기준 검증"을 이유로 직접 닫으셨던 것처럼, 작은 단위는 head가 움직여도 재검증 부담이 적습니다.


어느 쪽이든 좋습니다. (A) 메인테이너 cherry-pick 또는 (B) 분해 재제출 중 선호하시는 방향을 알려 주시면, 그에 맞춰 진행하겠습니다. 기여에 다시 한 번 감사드립니다.

edwardkim added a commit that referenced this pull request Jun 29, 2026
룰 준수 점검: 스크린샷/폰트/실문서/해시/CI응답은 모범적이나 시각 판정 권위
룰 위반(컨트리뷰터 눈대중 자인). 8+ fix 단일 묶음. cherry-pick은 검증자산
역상관으로 메인테이너 재구성 방식만 안전. 컨트리뷰터에 A(cherry-pick)/B(분해
재제출) 선택지 코멘트 등록.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
edwardkim added a commit that referenced this pull request Jun 29, 2026
@edwardkim

Copy link
Copy Markdown
Owner

2025 행정업무운영 편람(최종).pdf

윈도우11 한컴2020으로 PDF 내보내기 한 정답지입니다.

@jangster77

Copy link
Copy Markdown
Collaborator

#858
메인터너님과 기여자분들 단톡방이 있습니다. 서로 상의 하면서 개발하면 도움이 많이 됩니다. 혹시 의사가 있으시다면 참여부탁드립니다.
감사합니다.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants