-
Notifications
You must be signed in to change notification settings - Fork 321
Feature: Generate documentation in LLM-friendly Markdown #6555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
a925d9b to
6c11757
Compare
This enables LLM-friendly documentation for entire sections, allowing users to copy complete documentation sections with a single click. Lambda@Edge now generates .md files on-demand with: - Evaluated Hugo shortcodes - Proper YAML frontmatter with product metadata - Clean markdown without UI elements - Section aggregation (parent + children in single file) The llms.txt files are now generated automatically during build from content structure and product metadata in data/products.yml, eliminating the need for hardcoded files and ensuring maintainability. **Testing**: - Automated markdown generation in test setup via cy.exec() - Implement dynamic content validation that extracts HTML content and verifies it appears in markdown version **Documentation**: Documents LLM-friendly markdown generation **Details**: Add gzip decompression for S3 HTML files in Lambda markdown generator HTML files stored in S3 are gzip-compressed but the Lambda was attempting to parse compressed data as UTF-8, causing JSDOM to fail to find article elements. This resulted in 404 errors for .md and .section.md requests. - Add zlib gunzip decompression in s3-utils.js fetchHtmlFromS3() - Detect gzip via ContentEncoding header or magic bytes (0x1f 0x8b) - Add configurable DEBUG constant for verbose logging - Add debug logging for buffer sizes and decompression in both files The decompression adds ~1-5ms per request but is necessary to parse HTML correctly. CloudFront caching minimizes Lambda invocations. Await async markdown conversion functions The convertToMarkdown and convertSectionToMarkdown functions are async but weren't being awaited, causing the Lambda to return a Promise object instead of a string. This resulted in CloudFront validation errors: "The body is not a string, is not an object, or exceeds the maximum size" **Troubleshooting**: - Set DEBUG for troubleshooting in lambda
Implements static Markdown generation during Hugo build. **Key Features:** - Two-phase generation: HTML→MD (memory-bounded), MD→sections (fast) - Automatic redirect detection via file size check (skips Hugo aliases) - Product detection using compiled TypeScript product-mappings module - Token estimation for LLM context planning (4 chars/token heuristic) - YAML serialization with description sanitization **Performance:** - ~105 seconds for 5,000 pages + 500 sections - ~300MB peak memory (safe for 2GB CircleCI environment) - 23 files/sec conversion rate with controlled concurrency **Configuration Parameters:** - MIN_HTML_SIZE_BYTES (default: 1024) - Skip files below threshold - CHARS_PER_TOKEN (default: 4) - Token estimation ratio - Concurrency: 10 workers (CI), 20 workers (local) **Output:** - Single pages: public/*/index.md (with frontmatter + content) - Section bundles: public/*/index.section.md (aggregated child pages) **Files Changed:** - scripts/build-llm-markdown.js (new) - Main build script - scripts/lib/markdown-converter.cjs (renamed from .js) - Core conversion - scripts/html-to-markdown.js - Updated import path - package.json - Updated exports for .cjs module Related: Replaces Lambda@Edge on-demand generation (5s response time) with build-time static generation for production deployment. feat(deploy): Add staging deployment workflow and update CI Integrates LLM markdown generation into deployment workflows with a complete staging deployment solution. **CircleCI Updates:** - Switch from legacy html-to-markdown.js to optimized build:md - 2x performance improvement (105s vs 200s+ for 5000 pages) - Better memory management (300MB vs variable) - Enables section bundle generation (index.section.md files) **Staging Deployment:** - New scripts/deploy-staging.sh for local staging deploys - Complete workflow: Hugo build → markdown gen → S3 upload - Environment variable driven configuration - Optional step skipping for faster iteration - CloudFront cache invalidation support **NPM Scripts:** - Added deploy:staging command for convenience - Wraps deploy-staging.sh script **Documentation:** - Updated DOCS-DEPLOYING.md with comprehensive guide - Merged staging/production workflows with Lambda@Edge docs - Build-time generation now primary, Lambda@Edge fallback - Troubleshooting section with common issues - Environment variable reference - Performance metrics and optimization tips **Benefits:** - Manual staging validation before production - Consistent markdown generation across environments - Faster CI builds with optimized script - Better error handling and progress reporting - Section aggregation for improved LLM context **Usage:** ```bash export STAGING_BUCKET="test2.docs.influxdata.com" export AWS_REGION="us-east-1" export STAGING_CF_DISTRIBUTION_ID="E1XXXXXXXXXX" yarn deploy:staging ``` Related: Completes build-time markdown generation implementation refactor: Remove Lambda@Edge implementation Build-time markdown generation has replaced Lambda@Edge on-demand generation as the primary method. Removed Lambda code and updated documentation to focus on build-time generation and testing. Removed: - deploy/llm-markdown/ directory (Lambda@Edge code) - Lambda@Edge section from DOCS-DEPLOYING.md Added: - Testing and Validation section in DOCS-DEPLOYING.md - Focus on build-time generation workflow
7bdbe79 to
eede9c0
Compare
Implements core markdown-converter.cjs functions in Rust for performance comparison. Performance results: - Rust: ~257 files/sec (10× faster) - JavaScript: ~25 files/sec average Recommendation: Keep JavaScript for now, implement incremental builds first. Rust migration provides 10× speedup but requires 3-4 weeks integration effort. Files: - Cargo.toml: Rust dependencies (html2md, scraper, serde_yaml, clap) - src/main.rs: Core conversion logic + CLI benchmark tool - benchmark-comparison.js: Side-by-side performance testing - README.md: Comprehensive findings and recommendations
- Ensure dropdown stays within viewport bounds (min 8px padding) - Reposition dropdown on window resize and scroll events - Clean up event listeners when dropdown closes
Add remark-parse, remark-frontmatter, remark-gfm, and unified for enhanced markdown processing capabilities.
…tensions Without the return statement, the Lambda@Edge function would continue executing after the callback, eventually hitting the trailing-slash redirect logic. This caused .md files to redirect to URLs with trailing slashes, which returned 404 from S3.
- Add URL_PATTERN_MAP and PRODUCT_NAME_MAP constants directly in the CommonJS module (ESM product-mappings.js cannot be require()'d) - Update generateFrontmatter() to accept baseUrl parameter and construct full URLs for the frontmatter url field - Update generateSectionFrontmatter() similarly for section pages - Update all call sites to pass baseUrl parameter This fixes empty product fields and relative URLs in generated markdown frontmatter when served via Lambda@Edge.
Add -e, --env flag to html-to-markdown.js to control the base URL in generated markdown frontmatter. This matches Hugo's -e flag behavior and allows generating markdown with staging or production URLs. Also update build-llm-markdown.js with similar environment support.
- Add Rust-based HTML-to-Markdown converter with NAPI-RS bindings - Update Cypress markdown validation tests - Update deploy-staging.sh with force upload flag
- Defaults STAGING_URL to https://test2.docs.influxdata.com if not set - Exports it so yarn build:md -e staging can use it - Displays it in the summary
Copy section output for https://test2.docs.influxdata.com/influxdb3/core/write-data/Replace the following:
See how to Configure Telegraf to write to InfluxDB 3 Core. Use Telegraf with InfluxDBConfigure Telegraf to write to InfluxDB 3 CoreUpdate existing or create new Telegraf configurations to use the Use Telegraf to dual write to InfluxDBConfigure Telegraf to write data to multiple InfluxDB instances or clusters simultaneously. Use Telegraf to write CSV dataUse the Telegraf Data Collection with TelegrafLearn how to use Telegraf to make data time series data collection easy in this free InfluxDB University course. Troubleshoot issues writing dataLearn how to avoid unexpected results and recover from errors when writing to Handle write responsesInfluxDB 3 Core does the following when you send a write request:
The response body contains error details aboutrejected points, up to 100 points. Writes are synchronous–the response status indicates the final status of the To ensure that InfluxDB handles writes in the order you request them, Review HTTP status codesInfluxDB 3 Core uses conventional HTTP status codes to indicate the success
If your data did not write to the database, see how to troubleshoot rejected points. Troubleshoot failuresIf you notice data is missing in your database, do the following:
Troubleshoot rejected pointsInfluxDB rejects points that don’t match the schema of existing data. Check for field data typedifferences between the rejected data point and points within the same Troubleshoot write performance issuesIf you experience slow write performance or timeouts during high-volume ingestion, Memory configurationInfluxDB 3 Core uses memory for both query processing and internal data operations, Symptoms of memory-related write issues:
Solutions:
Example configuration for write-heavy workloadsinfluxdb3 serve \
--exec-mem-pool-bytes PERCENTAGE \
--gen1-duration 15m \
# ... other optionsReplace RelatedUse the influxdb3 CLI to write dataUse the Note Use the API for batching and higher-volume writesThe The Construct line protocolWith a basic understanding of line protocol,
The following line protocol represents the schema described above: For this tutorial, you can either pass this line protocol directly to the Write the line protocol to InfluxDBUse the
Note By default, InfluxDB 3 Core uses the timestamp magnitude to auto-detect the precision. stringinfluxdb3 write \
--database DATABASE_NAME \
--token AUTH_TOKEN \
'home,room=Living\ Room temp=21.1,hum=35.9,co=0i 1641024000
home,room=Kitchen temp=21.0,hum=35.9,co=0i 1641024000
home,room=Living\ Room temp=21.4,hum=35.9,co=0i 1641027600
home,room=Kitchen temp=23.0,hum=36.2,co=0i 1641027600
home,room=Living\ Room temp=21.8,hum=36.0,co=0i 1641031200
home,room=Kitchen temp=22.7,hum=36.1,co=0i 1641031200
home,room=Living\ Room temp=22.2,hum=36.0,co=0i 1641034800
home,room=Kitchen temp=22.4,hum=36.0,co=0i 1641034800
home,room=Living\ Room temp=22.2,hum=35.9,co=0i 1641038400
home,room=Kitchen temp=22.5,hum=36.0,co=0i 1641038400
home,room=Living\ Room temp=22.4,hum=36.0,co=0i 1641042000
home,room=Kitchen temp=22.8,hum=36.5,co=1i 1641042000'
Replace the following:
RelatedUse the InfluxDB HTTP API to write dataUse the InfluxDB HTTP API to write data to InfluxDB 3 Core. Tip Choose the write endpoint for your workloadWhen creating new write workloads, use theInfluxDB HTTP API When creating new write workloads, use theInfluxDB HTTP API When bringing existing v1 write workloads, use the InfluxDB 3 Core When bringing existing v2 write workloads, use the InfluxDB 3 Core For Telegraf, use the InfluxDB v1.x Use the v3 write_lp API to write dataUse the Use compatibility APIs and client libraries to write dataUse HTTP API endpoints compatible with InfluxDB v2 and v1 clients to write points as line protocol data to InfluxDB 3 Core. RelatedUse InfluxDB client libraries to write dataUse InfluxDB 3 client libraries that integrate with your code to construct data Set up your projectSet up your InfluxDB 3 Core project and credentials
After setting up InfluxDB 3 Core and your project, you should have the following:
Initialize a project directoryCreate a project directory and initialize it for your programming language. Go
Install the client libraryInstall the InfluxDB 3 client library for your programming language of choice. C#Add the InfluxDB 3 C# client library to your project using the dotnet add package InfluxDB3.ClientAdd theInfluxDB 3 Go client libraryto your project using the go mod init path/to/project/dir && cd $_
go get github.com/InfluxCommunity/influxdb3-go/v2/influxdb3Add the InfluxDB 3 Java client library to your project dependencies using For example, to add the library to a Maven project, add the following dependency <dependency>
<groupId>com.influxdb</groupId>
<artifactId>influxdb3-java</artifactId>
<version>1.1.0</version>
</dependency>To add the library to a Gradle project, add the following dependency to your dependencies {
implementation 'com.influxdb:influxdb3-java:1.1.0'
}For a Node.js project, use npm install --save @influxdata/influxdb3-clientInstall the InfluxDB 3 Python client library using pip install influxdb3-python pandasConstruct line protocolWith a basic understanding of line protocol, Use client library write methods to provide data as raw line protocol Client libraries provide one or more Examples in this guide show how to construct Example home schemaConsider a use case where you collect data from sensors in your home. Each To collect this data, use the following schema:
Go
The sample code does the following:
RelatedBest practices for writing dataThe following articles walk through recommendations and best practices for Optimize writes to InfluxDB 3 CoreTips and examples to optimize performance and system overhead when writing data to InfluxDB 3 Core. InfluxDB schema design recommendationsDesign your schema for simpler and more performant queries. |
Copy page output for https://test2.docs.influxdata.com/influxdb3/core/get-started/setup/When you run
The system displays warning messages showing the auto-generated identifiers: Important When to use quick-start modeQuick-start mode is designed for development, testing, and home lab environments Quick-start mode is designed for development, testing, and home lab environments For production deployments, use explicit configuration values with the Configuration precedence: Environment variables override auto-generated defaults. Start InfluxDBUse the
Note Diskless architectureInfluxDB 3 supports a diskless architecture that can operate with object InfluxDB 3 supports a diskless architecture that can operate with object For this getting started guide, use the # File system object store
# Provide the file system directory
influxdb3 serve \
--node-id host01 \
--object-store file \
--data-dir ~/.influxdb3Object store examplesFile system object store Store data in a specified directory on the local filesystem. Replace the following with your values: # File system object store
# Provide the file system directory
influxdb3 serve \
--node-id host01 \
--object-store file \
--data-dir ~/.influxdb3Docker with a mounted file system object store To run the Docker image and persist
Note The InfluxDB 3 Core Docker image exposes port Docker compose with a mounted file system object store Open # compose.yaml
services:
influxdb3-core:
image: influxdb:3-core
ports:
- 8181:8181
command:
- influxdb3
- serve
- --node-id=node0
- --object-store=file
- --data-dir=/var/lib/influxdb3/data
- --plugin-dir=/var/lib/influxdb3/plugins
volumes:
- type: bind
# Path to store data on your host system
source: ~/.influxdb3/data
# Path to store data in the container
target: /var/lib/influxdb3/data
- type: bind
# Path to store plugins on your host system
source: ~/.influxdb3/plugins
# Path to store plugins in the container
target: /var/lib/influxdb3/pluginsUse the Docker Compose CLI to start the server–for example: docker compose pull && docker compose up influxdb3-coreThe command pulls the latest InfluxDB 3 Core Docker image and starts Tip Custom port mappingTo customize your To customize your For more information about mapping your container port to a specific host port, see the S3 object storage Store data in an S3-compatible object store. # S3 object store (default is the us-east-1 region)
# Specify the object store type and associated options
influxdb3 serve \
--node-id host01 \
--object-store s3 \
--bucket OBJECT_STORE_BUCKET \
--aws-access-key AWS_ACCESS_KEY_ID \
--aws-secret-access-key AWS_SECRET_ACCESS_KEY# Minio or other open source object store
# (using the AWS S3 API with additional parameters)
# Specify the object store type and associated options
influxdb3 serve \
--node-id host01 \
--object-store s3 \
--bucket OBJECT_STORE_BUCKET \
--aws-access-key-id AWS_ACCESS_KEY_ID \
--aws-secret-access-key AWS_SECRET_ACCESS_KEY \
--aws-endpoint ENDPOINT \
--aws-allow-httpMemory-based object store Store data in RAM without persisting it on shutdown. # Memory object store
# Stores data in RAM; doesn't persist data
influxdb3 serve \
--node-id host01 \
--object-store memoryFor more information about server options, use the CLI help or view theInfluxDB 3 CLI reference: influxdb3 serve --helpTip Use the InfluxDB 3 Explorer query interfaceYou can complete the remaining steps in this guide using InfluxDB 3 Explorer, You can complete the remaining steps in this guide using InfluxDB 3 Explorer, For more information, see the InfluxDB 3 Explorer documentation. Set up authorizationInfluxDB 3 Core uses token-based authorization to authorize actions in the InfluxDB 3 Core supports admin tokens, which grant access to all CLI actions and API endpoints. For more information about tokens and authorization, see Manage tokens. Create an operator tokenAfter you start the server, create your first admin token. Use the CLIinfluxdb3 create token --admin# With Docker — in a new terminal:
docker exec -it CONTAINER_NAME influxdb3 create token --adminReplace The command returns a token string for authenticating CLI commands and API requests. Important Store your token securelyInfluxDB displays the token string only when you create it. InfluxDB displays the token string only when you create it. Set your token for authorizationUse your operator token to authenticate server actions in InfluxDB 3 Core, Use one of the following methods to provide your token and authenticate In your command, replace Environment variable (recommended)Set the export INFLUXDB3_AUTH_TOKEN=YOUR_AUTH_TOKENInclude the influxdb3 show databases --token YOUR_AUTH_TOKENFor HTTP API requests, include your token in the curl "http://localhost:8181/api/v3/configure/database" \
--header "Authorization: Bearer YOUR_AUTH_TOKEN"Learn more about tokens and permissions
Related |
|
Deployed to test2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements LLM-friendly Markdown generation for the InfluxData documentation site. It introduces a dual-implementation approach with a Rust converter (10x faster) and JavaScript fallback, along with UI components for copying/downloading documentation in various formats.
Key Changes:
- Rust-based HTML-to-Markdown converter with Node.js bindings via napi-rs
- JavaScript fallback using Turndown and JSDOM for broader compatibility
- Format selector UI component for accessing documentation in different formats
- Build scripts for generating Markdown at build time
- Hugo templates for llms.txt generation following llmstxt.org specification
Reviewed changes
Copilot reviewed 39 out of 42 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| package.json | Added dependencies (turndown, jsdom, remark family, p-limit) and new build scripts |
| yarn.lock | Lockfile updates for new dependencies including Rust napi-rs tooling |
| scripts/rust-markdown-converter/* | Rust implementation with Cargo config, build script, and library code |
| scripts/lib/markdown-converter.cjs | JavaScript fallback implementation with Turndown/JSDOM |
| scripts/html-to-markdown.js | CLI tool for HTML→Markdown conversion |
| scripts/build-llm-markdown.js | Optimized build script with two-phase conversion |
| scripts/deploy-staging.sh | Staging deployment script |
| layouts/partials/article/format-selector.html | UI component for format selection |
| layouts/index.llmstxt.txt | Root llms.txt template |
| layouts/_default/landing-influxdb.llmstxt.txt | Landing page llms.txt template |
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
| return; | ||
| } | ||
|
|
||
| if (productMappings && productMappings.initializeProductData) { |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This guard always evaluates to false.
| validateFrontmatter, | ||
| validateTable, | ||
| containsText, |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused imports containsText, validateFrontmatter.
| validateFrontmatter, | |
| validateTable, | |
| containsText, | |
| validateTable, |
| * | ||
| * @default 4 - Rough heuristic (4 characters ≈ 1 token) | ||
| */ | ||
| const CHARS_PER_TOKEN = 4; |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused variable CHARS_PER_TOKEN.
|
|
||
| const TurndownService = require('turndown'); | ||
| const { JSDOM } = require('jsdom'); | ||
| const path = require('path'); |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused variable path.
| const TurndownService = require('turndown'); | ||
| const { JSDOM } = require('jsdom'); | ||
| const path = require('path'); | ||
| const fs = require('fs'); |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused variable fs.
Changes
Content discovery
/llms.txt, following the https://llmstxt.org/ pattern for content discoveryUI
Markdown formatted content
index.section.mdandindex.mdurlvalues are full URLs. Hostname is determined by the environment that you pass to the build script--e.g.yarn build:md -e staginguses thetest2hostname.Build and deploy
yarn build:mdsets the hostname to use and generates the Markdown from/publicHTMLyarn deploy:stagingrunsbuild:md -e stagingands3deploydeploy/edge.jslambda function to return URLs as-is, instead of appending a trailing slash, if they end in a valid file extension.Tests
Improvements
product-mappings.tsshared utility for DRY access todata/products.ymlProduction deployment steps
deploy/edge.jsin this PR, publish a new version, update the behavior with the new versionyarn build:md)