Skip to content

Fix index generation for Sitemapper.stream#1

Open
newtrat wants to merge 3 commits into
masterfrom
fix-streamer-index-generation
Open

Fix index generation for Sitemapper.stream#1
newtrat wants to merge 3 commits into
masterfrom
fix-streamer-index-generation

Conversation

@newtrat
Copy link
Copy Markdown
Owner

@newtrat newtrat commented Apr 12, 2023

⚠️ WIP ⚠️ Needs tests and possibly test fixes as well!

Description

Re-organizes Builder and Streamer into InMemoryBuilder and StreamBuilder respectively.

Updates file generation for streaming so that:

  • The last file includes a page number, instead of just being named sitemap.xml
  • The index includes all generated pages

Also removes the encoding option passed when generating XML files, because:

  • libxml2 was occasionally crashing inside xmlFindCharEncodingHandler, and I am not equipped to fix that bug
  • We were using UTF-8, which is already libxml2's default, meaning we don't need to pass it in.

Motivation and Context

Streamer was working fine for generating sitemap files themselves, but it didn't interact well with Builder's generate and generate_index methods. The last generated sitemap would have no page number, and the generated index only knew about that single, last page. Modifying these to work for streamed sitemaps reduced Builder's and Streamer's overlap enough that I thought they belonged as two subclasses of a single base class, rather than having Streamer as a subclass of Builder. Their shared functionality is now inside Builder, and they themselves have been renamed StreamBuilder and InMemoryBuilder respectively.

I don't have a minimal reproduction ready yet for demonstrating the libxml2 crash, because I'm using sitemapper inside a private repo. However, I could definitely make one and report it as an issue if you'd like. It seems to happen only when generating the second or later pages of large sitemaps.

newtrat added 3 commits April 12, 2023 18:39
Re-organizes Builder and Streamer into InMemoryBuilder and
StreamBuilder respectively

Updates file generation for streaming so that:
- The last file includes a page number, instead of just being named
  sitemap.xml
- The index includes all generated pages

Needs tests and possibly test fixes as well
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant