feat(docs): sync knowledge base from remote docs API#2
Conversation
barckcode
left a comment
There was a problem hiding this comment.
Muy buen plumbing: cascada remote → cache → local, diff per-slug con re-embed selectivo, modo shadow, escrituras atómicas (tmp.replace(dst)), health endpoint con docs_last_sync_ok, fallback no fatal. La rama legacy intacta es lo correcto.
Tres blockers antes de mergear, y un par de mejoras nice-to-have.
Blockers
1. Enviar If-None-Match y manejar 304
DocsClient.fetch_manifest/fetch_body siempre bajan el body completo aunque haya cache local. Persistir el ETag junto al body cacheado y mandarlo en el siguiente fetch ahorra ancho de banda y aprovecha el shared cache de Cloudflare. Requiere el fix paralelo en helmcode/nan#4 (honrar If-None-Match en el endpoint).
2. El diff de shadow está comparando cosas distintas
main.py:34-37 hashea el .md local entero con frontmatter. El remoto hashea el body sin frontmatter (tras htmlToText, que también se va a eliminar con el cambio del endpoint en helmcode/nan#4). Resultado: el diff marca todo como changed siempre → el modo shadow pierde su valor como gate.
Hay que hashear el texto canónico que va al chunker en ambos lados (mismo algoritmo, mismo input). Después de los cambios de helmcode/nan#4 esto se simplifica porque ambos lados serán markdown plano del mismo origen.
3. Sin tests
Una PR que introduce un cliente HTTP nuevo, un loader con diff por hash, un background refresh y un modo shadow debe ir acompañada de tests. Mínimo:
_strip_frontmatter: con frontmatter, sin frontmatter, frontmatter malformado._SAFE_SLUG_RE: positivos (api,getting-started) y negativos (../etc/passwd,Foo, vacío, >64 chars).DocsClient.fetch_manifest/fetch_body: mock httpx → manifest válido, manifest con entry inválido (filtrado), body con hash mismatch → warning pero no aborta, fallback a cache cuando 5xx.load_documentation_from_remote: diff (unchanged skip, changed re-embed, removed cleanup), cascada remote → cache → local.
Nice-to-have (no bloqueantes)
- Race window en refresh:
_refresh_docs_oncemutastore._chunksmientrason_messagelee constore.search. Asyncio single-thread lo salva, pero hay variosawaiten medio del refresh. Unasyncio.Lockpor refresh deja las invariantes claras. - SQLite
_conn: añadircheck_same_thread=Falsepor si más adelante algo toca otro thread. settings.docs_use_remote: strsin validar:DOCS_USE_REMOTE=Remotecae alocalsilenciosamente.Literal["local","remote","shadow"]con pydantic.DocsClientsin retry/backoff: un 5xx transitorio = 15 min sin sync.httpx.AsyncHTTPTransport(retries=3)lo cubre.Manifest.versionpodría usarse como short-circuit (si no cambió, no iteres entries), pero requiere que enhelmcode/nan#4ese version sea estable.- Eliminar la doble serialización frontmatter (el endpoint lo añade,
_strip_frontmatterlo quita) cuando se haga el cambio enhelmcode/nan#4.
Address PR #2 review blockers. The bot now hashes/chunks/caches a single canonical text in local, remote and cache paths, sends conditional GETs, and short-circuits cleanly when the upstream manifest version is unchanged. - knowledge: canonicalize_doc_text() is the single source of truth for the text that enters the chunker. load_documentation() applies it with strip_frontmatter=True; load_documentation_from_remote() with strip_frontmatter=False on the happy path. The hash recorded in doc_hashes is now the hash of canonical text, so local and remote agree byte-for-byte on the same logical content. Expect a one-shot reindex on first deploy as old hashes (raw file with frontmatter) rotate. - SimpleVectorStore: new meta table with get_meta/set_meta, sqlite connection now uses check_same_thread=False. - load_documentation_from_remote: persists manifest.version in meta and short-circuits when the remote returns the same version, avoiding per-entry walks when nothing changed upstream. Skipped when source_of_truth==cache so we still notice drift while remote is unreachable. - docs_client: _EtagStore persists per-resource ETags atomically; fetch_manifest/fetch_body send If-None-Match and reuse cache on 304, refetching unconditionally if the cache went missing. AsyncHTTPTransport(retries=3) for connect-level retries, plus a bounded backoff on 5xx (0.5/1/2s). Body cache stores the canonical text, never the raw frontmatter; load_cached_body canonicalises on read so legacy caches keep working. - main shadow mode: compares canonical hashes on both sides (local strip=True, remote strip=False) and fetches each body once so the diff is real signal, not frontmatter noise. - config: docs_use_remote is Literal["local","remote","shadow"] so typos fail at startup instead of falling silently to local. - base: _refresh_docs_once wrapped in an asyncio.Lock so overlapping refreshes can't interleave around the embed/save cycle. - tests: conftest sets the required Settings env vars; new pytest suites cover _strip_frontmatter, _SAFE_SLUG_RE, ETag/304 round trips, hash-mismatch warnings, legacy cache normalisation, the canonicalize_doc_text contract, load_documentation idempotency, and the full load_documentation_from_remote matrix (unchanged skip, changed reindex, removed cleanup, remote->cache and remote->local fallbacks, manifest.version short-circuit, cache source not short-circuiting). pytest-httpx>=0.30.0 added to the dev extra. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add an optional remote sync path so the bot can index the docs
published by the website instead of the local snapshot under
bot/docs/knowledge/. Selected via DOCS_USE_REMOTE:
- local (default): keep current behavior, no remote calls.
- remote: load knowledge from the remote manifest/body API.
- shadow: load from local but also fetch the remote and log a
per-slug diff (local_only / remote_only / changed) — useful to
validate parity before flipping to remote.
What changes:
- New bot/docs_client.py — typed async client over httpx for
GET /api/docs/manifest.json and GET /api/docs/{slug}.md, with
on-disk cache under DOCS_CACHE_DIR (manifest + per-slug body)
for fallback when the remote is unreachable.
- New load_documentation_from_remote() in bot/knowledge.py — diffs
per-entry sha256, only re-chunks/embeds what changed, drops
sources that disappeared from the manifest, falls back to cache
and then to local docs if both fail.
- bot/base.py: background refresh task (DOCS_REFRESH_INTERVAL,
60s floor) that calls _refresh_docs_once, embeds new chunks and
reports docs_last_sync / docs_last_sync_ok on /health. The !docs
command now lists tracked sources from the vector store, not the
local filesystem.
- main.py: cold start branches on DOCS_USE_REMOTE; shadow mode also
exercises fetch_body to surface remote errors early.
- config: docs_base_url, docs_refresh_interval, docs_use_remote,
docs_cache_dir, docs_http_timeout.
- pyproject: add httpx>=0.27.
The legacy local loader (bot/docs/knowledge/*.md + load_documentation)
is intentionally left untouched so deploys can keep using local until
remote sync is validated in prod; cleanup is a follow-up.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Address PR #2 review blockers. The bot now hashes/chunks/caches a single canonical text in local, remote and cache paths, sends conditional GETs, and short-circuits cleanly when the upstream manifest version is unchanged. - knowledge: canonicalize_doc_text() is the single source of truth for the text that enters the chunker. load_documentation() applies it with strip_frontmatter=True; load_documentation_from_remote() with strip_frontmatter=False on the happy path. The hash recorded in doc_hashes is now the hash of canonical text, so local and remote agree byte-for-byte on the same logical content. Expect a one-shot reindex on first deploy as old hashes (raw file with frontmatter) rotate. - SimpleVectorStore: new meta table with get_meta/set_meta, sqlite connection now uses check_same_thread=False. - load_documentation_from_remote: persists manifest.version in meta and short-circuits when the remote returns the same version, avoiding per-entry walks when nothing changed upstream. Skipped when source_of_truth==cache so we still notice drift while remote is unreachable. - docs_client: _EtagStore persists per-resource ETags atomically; fetch_manifest/fetch_body send If-None-Match and reuse cache on 304, refetching unconditionally if the cache went missing. AsyncHTTPTransport(retries=3) for connect-level retries, plus a bounded backoff on 5xx (0.5/1/2s). Body cache stores the canonical text, never the raw frontmatter; load_cached_body canonicalises on read so legacy caches keep working. - main shadow mode: compares canonical hashes on both sides (local strip=True, remote strip=False) and fetches each body once so the diff is real signal, not frontmatter noise. - config: docs_use_remote is Literal["local","remote","shadow"] so typos fail at startup instead of falling silently to local. - base: _refresh_docs_once wrapped in an asyncio.Lock so overlapping refreshes can't interleave around the embed/save cycle. - tests: conftest sets the required Settings env vars; new pytest suites cover _strip_frontmatter, _SAFE_SLUG_RE, ETag/304 round trips, hash-mismatch warnings, legacy cache normalisation, the canonicalize_doc_text contract, load_documentation idempotency, and the full load_documentation_from_remote matrix (unchanged skip, changed reindex, removed cleanup, remote->cache and remote->local fallbacks, manifest.version short-circuit, cache source not short-circuiting). pytest-httpx>=0.30.0 added to the dev extra. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Sort imports, drop quoted type annotation, drop unused Manifest import - Import DocsClient under TYPE_CHECKING so the annotation in load_documentation_from_remote resolves without a runtime cycle Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Align with main's ruff format pass (commit 7a56238). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
a2de469 to
4e09621
Compare
- Add DOCS_USE_REMOTE / DOCS_BASE_URL / DOCS_REFRESH_INTERVAL / DOCS_CACHE_DIR / DOCS_HTTP_TIMEOUT to the env vars table. - Rewrite "Knowledge base" to describe the three modes (local, remote, shadow), the manifest.version short-circuit, and the If-None-Match/304 cache layout under DOCS_CACHE_DIR. - Replace the "no test suite" line with a pointer to pytest + pytest-httpx and CONTRIBUTING.md's testing policy. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
bot/docs_client.py) sobrehttpxque consume la API pública de docs publicada por helmcode/nan:GET /api/docs/manifest.jsonyGET /api/docs/{slug}.md. Verificasha256y cachea en disco (DOCS_CACHE_DIR) como fallback si el remoto cae.load_documentation_from_remote()(bot/knowledge.py) que diff-ea porcontentHash, solo re-chunks/embeds lo que cambió, elimina sources que ya no están en el manifest y cae a cache → local en cascada.bot/base.pyque refresca cadaDOCS_REFRESH_INTERVAL(suelo 60s), reembedde lo nuevo y exponedocs_last_sync+docs_last_sync_oken/health. El comando!docsya no listabot/docs/knowledge/*.md, lista los sources rastreados en el vector store.main.pybifurcado porDOCS_USE_REMOTE:local(default) — comportamiento actual, sin llamadas remotas.remote— solo remoto, con fallback a local si el remoto y la cache fallan.shadow— sigue cargando local pero también busca el manifest remoto y loguea un diff por slug (local_only/remote_only/changed). Pensado para validar paridad antes del switch.Por qué
Hasta ahora los
.mddebot/docs/knowledge/vivían en este repo, lo que obligaba a editar dos repos por cada cambio de documentación y dejaba al bot fuera de sync. Con esto, la web es la fuente única y el bot la reindexa por hash.Notas para revisar
bot/docs/knowledge/*.md+load_documentation) se deja intacta: el deploy actual sigue funcionando conDOCS_USE_REMOTE=local. La limpieza de esos.mdy deload_documentationserá un follow-up cuando validemosremoteen prod.DocsClientvalida el slug con^[a-z0-9][a-z0-9-]{0,63}$antes de ir a disco para evitar path traversal en la cache.shadowtambién hacefetch_bodyde cada entrada para sacar a la luz errores del remoto pronto (no solo el manifest).httpx>=0.27añadido como dependencia nueva enpyproject.toml.Test plan
uv syncpara resolverhttpx.http://localhost:4321) yDOCS_USE_REMOTE=remote: validar manifest,fetch_bodypor slug y que los hashes cuadran.shadowen staging contrahttps://nan.builders: revisar los logsShadow diff: local_only=… remote_only=… changed=…y que no haya fetch fallidos./healthdevuelvedocs_last_syncydocs_last_sync_ok=truedespués del primer refresh.!docslista los slugs del store y no los archivos locales.🤖 Generated with Claude Code