Add Markdown conversion scripts. #22

kattni · 2025-08-27T00:39:13Z

@mhsmith Here are the scripts and processes I am using, as requested in the Tutorial PR.

Converting documentation from rST to Markdown using Pandoc and a custom Python script

The Pandoc Markdown cleanup script was written initially with Toga in mind, but it applies to any Pandoc converted reStructuredText, including the BeeWare Tutorial. There are some hard-coded toga references in it, but those will be easy to update to work with Briefcase and Rubicon Objective-C, when/if the time comes.

To be clear, a significant portion of the specific work referenced in the Toga PR was manually done, as the point was to put together a proof of concept, not sort out my automation. The tutorial required much less manual intervention.

I ran a specific Pandoc command (included below) which is basically the best Pandoc can do in this case. Pandoc is not aware of the Python Markdown plugins I am using, and therefore cannot output everything exactly as needed. Therefore, I spent a significant amount of time early in this process going through every individual initially converted file in Toga and the BeeWare Tutorial, and extracted test strings for every bit of Pandoc-provided syntax that needed to be updated to work with the Python Markdown plugins I am using and my MkDocs configuration. I have included either the exact test strings from the documentation content, or pseudo-versions of what the strings look like, in the docstrings for most of the functions included in this script. I used them for testing, and it made sense to leave them in the file to make it clear what the regex is looking for.

The script takes the rST :notation: links into account, and updates them accordingly to render as they did in rST; I know this was a concern from a previous rST to Markdown conversion. The Pandoc command takes something like ":class:~toga.sources.Row" and converts it to "~toga.sources.Row{ interpreted-text role="class" }", which is what the clean-up script uses to generate the updated links. The script takes the ~ into account when generating the autorefs links, though the Toga PR brought some quirks of rST class/method links to Russ' attention, and it turns out he would prefer all methods be rendered appended to the class name, which is not consistent through the current documentation; I will be updating the script to address that change.

Worth noting a couple of things. I discovered the autorefs linking feature after writing a couple of the functions, so they may be written awkwardly or include unnecessary link info in the docstring (as in the Python doc-links function, for example). As well, a few of the functions are basically one-offs needed for odd notations that came out of the initial Pandoc conversion of Toga at the beginning of this process, and only apply to a couple of use cases. In my thorough search through Toga, I found that there were a couple of instances where an include was replaced with the actual text from the included file; I have addressed the instances I am aware of in the current PR, however I will need to search the existing documentation to make sure it didn't happen elsewhere as well.

I have now run this script on both the BeeWare Tutorial, and Toga. It mostly handled the tutorial, however, it missed some obvious things in Toga, which highlighted places where it needs work. This is the first time I've worked with regexes, and, given that, I'm pretty happy with how well it worked on the first run. There are a few TODOs in the script that are either about when to run a particular function before or after another one, or something that needs to be addressed with the code.

Following running the script, I went through every updated file individually in the Tutorial with the intention of catching anything missed by the script, as well as cleaning up extra whitespace. I don't recall every bit of consistently missed syntax I had to manually fix. A few examples I remember:

I initially used autorefs links for the "Now go to the next page" links at the bottom of each of the tutorial pages, but autorefs links resolve to the associated anchor, which means, for those, it would navigate to the page with the title at the top and the header off-screen; I updated all of those to actual page links to avoid that issue.
There was some consistent extra whitespace after the tab and admonition syntax, as well as a missed colon here and there from the Pandoc-converted tab and admonition syntax. So the regexes need to be updated there.
On the Toga run, I forgot to add backticks to the code-rendered autorefs links in the initial run, so there is also a function at the end that resolved that issue. I updated the actual link function to include the backticks, but left the followup function in case I didn't manage to catch all the places in the primary link function that needed backticks.
Also on the Toga run, some of the :notation: links were missed on the first run of the script. The function that handles them is one of the most complicated functions in the script, so I quickly wrote up a second function to grab what was left. I couldn't get the regex to work with all of them in a reasonable amount of time, so for the ones that weren't isolated and updated by it, I added a TODO note, and then searched that and manually updated those that were left over.
In some cases, the regex would miss improperly formatted links, which meant the beginning of one link was updated to [, and then a while later, the end of another link was updated to ]. It is relatively straightforward to reliably find any missed links by searching the files for various parts of the rST link syntax, e.g. search for ">`__", or a regex search that searches for a single backtick followed by text followed by a "<". I was able to identify missed links this way, and it was a signal to verify that file for other missed links between the partially updated links. It's unclear why rST wasn't complaining about the links to begin with, but apparently it misses things.

Customised Pandoc command

I ran Pandoc with a modified commonmark_x output format, which is the basic CommonMark Markdown syntax, with a series of Pandoc extensions enabled by default. The following is the command. As it is written, it should be run in the docs directory, or it will convert any rST files it finds recursively from wherever you run it. Pandoc does not work recursively natively, so the find is necessary to run it on multiple files.

find -f **/*.rst -type f -exec sh -c 'pandoc -f rst "${0}" -t commonmark_x-bracketed_spans-smart-alerts -o "$(dirname {})/$(basename {} .rst).md"' {} \;

This command disables the bracketed_spans, smart, and alerts Pandoc extensions that are otherwise included with commonmark_x. They added a lot of unnecessary syntax to the Markdown that would have made a lot of extra work. I had already written the script before realising that I could modify the extensions included in a Pandoc command. There are a couple of other extensions that, if excluded, produce Markdown that is vaguely easier to work with, however, as I had already written most of the conversion script by the time I figured out the situation with disabling extensions, it made more sense to keep them included as I had already addressed the provided syntax. (I don't remember which extensions they were at this point.) Finally, it outputs the converted Markdown files alongside the rST files.

Translation conversion

The translations were converted from rST formatting to Markdown by a separate script, also included here. Once the rST PO files were converted, I was able to generate a set of PO template files from the English Markdown, and use a translation tool called pomerge to merge the now-Markdown-formatted PO files into a new set of files that had the proper file locations with each msg* string pair. A few strings were lost due to major syntactical changes, but the majority of the translations were preserved, including the latest additions to the German translation.

The translation conversion script ran into some problems because it turns out the translations are loaded with improperly formatted links. Link issues I encountered: links wrapped in double quotes instead of backticks, links with spaces in them, links missing one of the backticks, links missing both backticks, links followed by only one underscore, and more I can't remember. I went through every PO file and searched for link formatting issues and resolved them manually.

Docstring conversion

I have the early skeleton of a docstring conversion script. The docstrings in the source code for module documentation will need to be updated to use Markdown syntax for links etc. to render properly. I mostly did this manually for the Toga PR on the few files that I updated for the proof of concept, though I was able to run a single function to convert most of the rST links to Markdown autorefs format. I am not include it here as it is in the very early stages of development. I have not prioritised it yet, as a final decision hasn't been made regarding the Toga etc. switch to MkDocs.

PR Checklist:

All new features have been tested
All new features have been documented
I have read the CONTRIBUTING.md file
I will abide by the code of conduct

kattni · 2025-08-28T23:08:53Z

This should remain unmerged for now as the Pandoc script will likely be receiving updates.

Add Markdown conversion scripts.

4f348fb

kattni mentioned this pull request Aug 27, 2025

Shift BeeWare tutorial to MkDocs beeware/beeware-tutorial#3

Merged

4 tasks

kattni marked this pull request as draft August 28, 2025 23:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add Markdown conversion scripts. #22

Add Markdown conversion scripts. #22

Uh oh!

kattni commented Aug 27, 2025

Uh oh!

kattni commented Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Add Markdown conversion scripts. #22

Are you sure you want to change the base?

Add Markdown conversion scripts. #22

Uh oh!

Conversation

kattni commented Aug 27, 2025

Converting documentation from rST to Markdown using Pandoc and a custom Python script

Customised Pandoc command

Translation conversion

Docstring conversion

PR Checklist:

Uh oh!

kattni commented Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant