A Python tool to generate high-quality Dash docsets for Kubernetes documentation with enhanced semantic indexing.
-
Enhanced Indexing: Extracts ~8,000+ searchable entries including:
- API Types (PodSpec, DeploymentStatus, Container, etc.)
- Sections (Volumes, Scheduling, Security Context, etc.)
- Guides (concepts, tasks, tutorials)
- Glossary terms
- API Resources
- kubectl Commands
- Components
-
Sitemap-Based Scraping: Uses kubernetes.io/en/sitemap.xml to download ~900 valid doc pages with zero 404 errors
-
Multiple Input Sources:
- Scrape directly from kubernetes.io (
--scrape) - Use existing Dash-generated docset (
--source)
- Scrape directly from kubernetes.io (
-
Validation: Built-in docset validation script and pre-commit hooks
-
Intelligent Parsing: Specialized parsers for different documentation types:
- API reference pages
- kubectl command documentation
- Conceptual guides
- Glossary entries
- Code samples
- Embedded Dash anchors
- Python >= 3.14 (or adjust in
pyproject.toml) - uv package manager
git clone https://github.com/ichoosetoaccept/kubernetes-docset.git
cd kubernetes-docsetuv run python main.py --scrapeThis will:
- Fetch sitemap and download ~900 documentation pages from kubernetes.io
- Download CSS, JS, and image assets
- Cache everything in
.cache/kubernetes-docs/ - Generate a Dash docset in
./output/with relative asset paths
uv run python main.py --source ~/Library/Application\ Support/Dash/Docset\ Generator/Kubernetes/Kubernetes.docsetThis option uses a pre-generated docset from Dash's built-in scraper as the source, which may provide additional embedded anchors.
# Specify output directory
uv run python main.py --scrape --output ./my-docsets
# Specify Kubernetes version
uv run python main.py --scrape --version 1.34
# Use custom cache directory
uv run python main.py --scrape --cache-dir ./cacheAfter generating the docset:
- Open Dash
- Go to Preferences > Docsets
- Click the + button
- Select the generated
.docsetfile
Or simply double-click the .docset file.
The tool uses multiple specialized parsers that analyze the HTML structure:
- DashAnchorParser: Extracts embedded
dashAnchorentries from Dash-generated HTML - EnhancedAPIReferenceParser: Extracts Types, Sections, and Properties from API reference HTML
- EnhancedKubectlParser: Extracts kubectl commands and subcommands
- CodeSampleParser: Identifies Kubernetes manifests in code blocks
- APIResourceParser: Identifies API resources (Pod, Deployment, Service, etc.)
- KubectlCommandParser: Extracts kubectl command documentation
- GuideParser: Indexes conceptual guides and tutorials
- GlossaryParser: Extracts glossary terms
- ComponentParser: Identifies Kubernetes components (kube-apiserver, kubelet, etc.)
- FallbackParser: Catches any unmatched documentation pages
All matching parsers run on each file, allowing multiple entry types per page. For example, an API reference page might contribute:
- A Resource entry for the API object
- Multiple Type entries for embedded types (PodSpec, Container, etc.)
- Multiple Section entries for field groups
# Validate the generated docset
uv run python verify.py -v
# Run pre-commit validation hook
prek run --hook-stage manual validate-docset.
├── main.py # CLI entry point
├── verify.py # Docset validation script
├── k8s_docset/
│ ├── __init__.py
│ ├── builder.py # Docset builder (path fixing, TOC injection)
│ ├── parsers.py # Core HTML parsers
│ ├── enhanced_parsers.py # Enhanced parsers for raw HTML
│ └── scraper.py # Sitemap-based web scraper
├── contrib/ # Dash contribution files
│ ├── docset.json
│ ├── icon.png
│ └── [email protected]
├── .pre-commit-config.yaml # Pre-commit hooks (ruff, validation)
└── pyproject.toml # Project dependencies
MIT License - see LICENSE for details.