Skip to content

Fix corrupt PDF crash and --global-index field parsing for auto-generated COLLECTION.md#30

Merged
ali5ter merged 1 commit into
mainfrom
fix/convert-corrupt-pdf-and-global-index-fields
Apr 14, 2026
Merged

Fix corrupt PDF crash and --global-index field parsing for auto-generated COLLECTION.md#30
ali5ter merged 1 commit into
mainfrom
fix/convert-corrupt-pdf-and-global-index-fields

Conversation

@ali5ter
Copy link
Copy Markdown
Owner

@ali5ter ali5ter commented Apr 14, 2026

Summary

  • Bug 1 — Corrupt PDF crash: convert_publication() calls fitz.open(), which raises pymupdf.FileDataError on corrupt PDFs, aborting the entire run. Wraps the call in try/except Exception so the offending PDF is skipped with a warning and the run continues.
  • Bug 2 — --global-index misses auto-generated fields: The write_global_index() parser matched only the hand-crafted bold format (**Period**, **Pages**) but --write-collection-md outputs plain-text labels (Date range, Total pages). Updated the two re.search regexes to match both formats via alternation, so Period and Pages populate correctly regardless of how COLLECTION.md was authored.

Both fixes were surfaced in the downstream instance (ali5ter/electronics-publications-library@f8f8403).

Test plan

  • Run convert.py against a directory that includes a corrupt PDF — confirm it logs WARNING: skipping <file> and continues rather than crashing
  • Run convert.py --write-collection-md to generate an auto-format COLLECTION.md, then run convert.py --global-index collections/ — confirm Period and Pages columns are populated in CATALOGUE.md
  • Run the same --global-index against a hand-crafted COLLECTION.md with **Period**/**Pages** bold fields — confirm those still populate correctly

🤖 Generated with Claude Code

…ated COLLECTION.md

- Wrap convert_publication() call in try/except so corrupt PDFs log a
  warning and are skipped rather than aborting the entire run
- Update --global-index regexes to match both hand-crafted bold format
  (**Period**, **Pages**) and auto-generated plain format (Date range,
  Total pages) so period/pages columns populate correctly for
  collections built with --write-collection-md

Fixes surfaced in ali5ter/electronics-publications-library commit f8f8403.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ali5ter ali5ter merged commit 2cd8f15 into main Apr 14, 2026
1 check passed
@ali5ter ali5ter deleted the fix/convert-corrupt-pdf-and-global-index-fields branch April 14, 2026 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant