Add Markdown conversion scripts. #22
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@mhsmith Here are the scripts and processes I am using, as requested in the Tutorial PR.
Converting documentation from rST to Markdown using Pandoc and a custom Python script
The Pandoc Markdown cleanup script was written initially with Toga in mind, but it applies to any Pandoc converted reStructuredText, including the BeeWare Tutorial. There are some hard-coded
togareferences in it, but those will be easy to update to work with Briefcase and Rubicon Objective-C, when/if the time comes.To be clear, a significant portion of the specific work referenced in the Toga PR was manually done, as the point was to put together a proof of concept, not sort out my automation. The tutorial required much less manual intervention.
I ran a specific Pandoc command (included below) which is basically the best Pandoc can do in this case. Pandoc is not aware of the Python Markdown plugins I am using, and therefore cannot output everything exactly as needed. Therefore, I spent a significant amount of time early in this process going through every individual initially converted file in Toga and the BeeWare Tutorial, and extracted test strings for every bit of Pandoc-provided syntax that needed to be updated to work with the Python Markdown plugins I am using and my MkDocs configuration. I have included either the exact test strings from the documentation content, or pseudo-versions of what the strings look like, in the docstrings for most of the functions included in this script. I used them for testing, and it made sense to leave them in the file to make it clear what the regex is looking for.
The script takes the rST
:notation:links into account, and updates them accordingly to render as they did in rST; I know this was a concern from a previous rST to Markdown conversion. The Pandoc command takes something like ":class:~toga.sources.Row" and converts it to "~toga.sources.Row{ interpreted-text role="class" }", which is what the clean-up script uses to generate the updated links. The script takes the~into account when generating the autorefs links, though the Toga PR brought some quirks of rST class/method links to Russ' attention, and it turns out he would prefer all methods be rendered appended to the class name, which is not consistent through the current documentation; I will be updating the script to address that change.Worth noting a couple of things. I discovered the autorefs linking feature after writing a couple of the functions, so they may be written awkwardly or include unnecessary link info in the docstring (as in the Python doc-links function, for example). As well, a few of the functions are basically one-offs needed for odd notations that came out of the initial Pandoc conversion of Toga at the beginning of this process, and only apply to a couple of use cases. In my thorough search through Toga, I found that there were a couple of instances where an
includewas replaced with the actual text from the included file; I have addressed the instances I am aware of in the current PR, however I will need to search the existing documentation to make sure it didn't happen elsewhere as well.I have now run this script on both the BeeWare Tutorial, and Toga. It mostly handled the tutorial, however, it missed some obvious things in Toga, which highlighted places where it needs work. This is the first time I've worked with regexes, and, given that, I'm pretty happy with how well it worked on the first run. There are a few TODOs in the script that are either about when to run a particular function before or after another one, or something that needs to be addressed with the code.
Following running the script, I went through every updated file individually in the Tutorial with the intention of catching anything missed by the script, as well as cleaning up extra whitespace. I don't recall every bit of consistently missed syntax I had to manually fix. A few examples I remember:
:notation:links were missed on the first run of the script. The function that handles them is one of the most complicated functions in the script, so I quickly wrote up a second function to grab what was left. I couldn't get the regex to work with all of them in a reasonable amount of time, so for the ones that weren't isolated and updated by it, I added aTODOnote, and then searched that and manually updated those that were left over.[, and then a while later, the end of another link was updated to]. It is relatively straightforward to reliably find any missed links by searching the files for various parts of the rST link syntax, e.g. search for ">`__", or a regex search that searches for a single backtick followed by text followed by a "<". I was able to identify missed links this way, and it was a signal to verify that file for other missed links between the partially updated links. It's unclear why rST wasn't complaining about the links to begin with, but apparently it misses things.Customised Pandoc command
I ran Pandoc with a modified
commonmark_xoutput format, which is the basic CommonMark Markdown syntax, with a series of Pandoc extensions enabled by default. The following is the command. As it is written, it should be run in thedocsdirectory, or it will convert any rST files it finds recursively from wherever you run it. Pandoc does not work recursively natively, so thefindis necessary to run it on multiple files.This command disables the
bracketed_spans,smart, andalertsPandoc extensions that are otherwise included withcommonmark_x. They added a lot of unnecessary syntax to the Markdown that would have made a lot of extra work. I had already written the script before realising that I could modify the extensions included in a Pandoc command. There are a couple of other extensions that, if excluded, produce Markdown that is vaguely easier to work with, however, as I had already written most of the conversion script by the time I figured out the situation with disabling extensions, it made more sense to keep them included as I had already addressed the provided syntax. (I don't remember which extensions they were at this point.) Finally, it outputs the converted Markdown files alongside the rST files.Translation conversion
The translations were converted from rST formatting to Markdown by a separate script, also included here. Once the rST PO files were converted, I was able to generate a set of PO template files from the English Markdown, and use a translation tool called
pomergeto merge the now-Markdown-formatted PO files into a new set of files that had the proper file locations with eachmsg*string pair. A few strings were lost due to major syntactical changes, but the majority of the translations were preserved, including the latest additions to the German translation.The translation conversion script ran into some problems because it turns out the translations are loaded with improperly formatted links. Link issues I encountered: links wrapped in double quotes instead of backticks, links with spaces in them, links missing one of the backticks, links missing both backticks, links followed by only one underscore, and more I can't remember. I went through every PO file and searched for link formatting issues and resolved them manually.
Docstring conversion
I have the early skeleton of a docstring conversion script. The docstrings in the source code for module documentation will need to be updated to use Markdown syntax for links etc. to render properly. I mostly did this manually for the Toga PR on the few files that I updated for the proof of concept, though I was able to run a single function to convert most of the rST links to Markdown autorefs format. I am not include it here as it is in the very early stages of development. I have not prioritised it yet, as a final decision hasn't been made regarding the Toga etc. switch to MkDocs.
PR Checklist: