Skip to content

Finish LLM text exporter #1417

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jun 24, 2025
Merged

Finish LLM text exporter #1417

merged 9 commits into from
Jun 24, 2025

Conversation

Mpdreamz
Copy link
Member

@Mpdreamz Mpdreamz commented Jun 23, 2025

  • Reorganize Elastic.Markdown so that code is grouped by purpose not type

Thought this was a prerequisite for LLM text by utilizing markdigs round trip serializer as a base of our own. However that heavily relies on parsing with Trivia something we can not do because it breaks list continuations. See #435

We now manually parse includes and re-evaluate substitutions on the full included files. Adding the exporter does not add much overhead in addition to the HTML exporter

This emits a filename.md next to each filename/index.html

image

In addition it emits a llm.zip that can be used to download everything at once.

@Mpdreamz Mpdreamz requested a review from a team as a code owner June 23, 2025 12:45
@Mpdreamz Mpdreamz self-assigned this Jun 23, 2025
@Mpdreamz Mpdreamz changed the title feature/llm text output Finish LLM text exporter Jun 23, 2025
@theletterf
Copy link
Contributor

Niceeee! How does this work? Do we generate the final Markdown from an intermediate representation/AST? It's important that the final file we produce has the same content as the rendered HTML, that is, resolved substitutions, etc.

@Mpdreamz
Copy link
Member Author

Do we generate the final Markdown from an intermediate representation/AST?

We do not sadly, that was the initial plan but would be too time costly too implement due to a quirk in our parser's handling of TrackTrivia and loose list continuations.

It's important that the final file we produce has the same content as the rendered HTML, that is, resolved substitutions, etc.

In the end we do get the same content, we might need to go over this again when we implement more dynamic {applies_to} output.

@theletterf
Copy link
Contributor

I guess converting from the final HTML to Markdown would be too primitive / slow? I used that approach in the past for a NextJS project for generating the llmstxt file and it wasn't too bad.

@Mpdreamz
Copy link
Member Author

I guess converting from the final HTML to Markdown would be too primitive / slow?

Yeah potentially, it's also more labor intensive projecting everything back with proper indentations etcetera. I would worry too much about lists of list etcetera.

Not closing the door doing that but what we have now is good enough.

@Mpdreamz Mpdreamz enabled auto-merge (squash) June 24, 2025 13:21
@Mpdreamz Mpdreamz merged commit cb72d5e into main Jun 24, 2025
16 checks passed
@Mpdreamz Mpdreamz deleted the feature/llm-text-output branch June 24, 2025 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants