Skip to content

feat(portal): optimize portal discovery for AI agents and crawlers#60

Merged
BatLeDev merged 8 commits into
masterfrom
optim-agents-indexing
May 18, 2026
Merged

feat(portal): optimize portal discovery for AI agents and crawlers#60
BatLeDev merged 8 commits into
masterfrom
optim-agents-indexing

Conversation

@BatLeDev
Copy link
Copy Markdown
Member

@BatLeDev BatLeDev commented May 18, 2026

  • Declare AI content usage preferences in robots.txt via the Cloudflare Content-Signal header (search=yes, ai-train=no, ai-input=yes)
  • Expose /.well-known/security.txt (RFC 9116) with a 1-year Expires and two Contact lines: https://github.com/data-fair (always) and the portal's /contact page when that page is actually configured (same mongo existence check as the sitemap route); contactInformations.email is intentionally kept private
  • Expose /.well-known/change-password (W3C) — 302 to /simple-directory/login?action=changePassword when authentication is enabled, 404 when authentication: 'none' or for drafts
  • Expose /.well-known/api-catalog (RFC 9727) as application/linkset+json, anchored on the data-fair root API + one entry per published dataset; 404 when allowRobots: false or for drafts; data-fair fetch errors propagate as 500 (aligned on sitemap.xml.ts)
  • Align application-card elevation/rounded fallback on portalConfig.defaults to match news-card, reuse-card, dataset-metadata, etc.
  • Update e2e assertions to match the Allow-list format introduced previously and cover the three new .well-known endpoints

BatLeDev added 6 commits May 18, 2026 11:48
Add a Content-Signal directive (draft-romm-aipref-contentsignals) to
robots.txt to opt out of AI training while keeping classic search
indexing and live agent retrieval enabled.

- search=yes: keep visibility in Google/Bing
- ai-train=no: data changes too frequently to be safely frozen in a
  training corpus
- ai-input=yes: stay reachable for live RAG (Perplexity, ChatGPT
  browse, Claude web search)
Adds /.well-known/security.txt advertising a contact mailbox for
security disclosures. Expires is recomputed at each request (now + 1
year) so the file never falls out of RFC compliance without a deploy.

Served regardless of allowRobots / draft status — the security contact
must remain reachable on any deployed portal.
Adds /.well-known/change-password redirecting to the simple-directory
password-change flow. Used by password managers (1Password, Bitwarden,
Apple Passwords, Chrome) to auto-navigate users when triggered from
the saved-passwords UI.

Returns 404 when authentication is disabled on the portal or when the
portal is a draft — no point advertising a password flow that has no
accounts to manage.
Adds /.well-known/api-catalog returning application/linkset+json. The
linkset starts with the global data-fair API entry (service-desc:
OpenAPI spec, service-doc: /catalog-api-doc, status: /ping, collection:
/datasets) and then enumerates one entry per dataset published on the
portal, each pointing to its own filtered OpenAPI spec and human-doc
page.

Each dataset is genuinely a distinct API surface (filtered actions,
dedicated OpenAPI spec), so enumeration honours the RFC 9727 model of
listing APIs rather than resources. Capped at 1000 entries (same limit
as sitemap.xml).

Gated by allowRobots and draft like sitemap / robots.txt — hidden
portals return 404.

Also adds e2e coverage in seo-indexing for the three new well-known
endpoints (security.txt, change-password, api-catalog).
The robots.txt response was refactored to publish an explicit Allow-list
of public sections followed by a fallback Disallow: / (so unknown paths
are blocked rather than reachable). The legacy assertions still expected
a single Allow: / with no Disallow, which always fails on the new
output.

Update the indexable-portal assertions to check for representative Allow
rules (Allow: /$, Allow: /datasets) and drop the Disallow exclusion. On
the hidden-portal side, additionally assert no Allow rule leaks through.
BatLeDev added 2 commits May 18, 2026 15:48
Align with sitemap.xml.ts which lets errors propagate to h3's default
500 handler instead of returning a partial linkset.
Surface https://github.com/data-fair as a stable Contact, plus the
portal's /contact page when it has actually been configured (same
mongo existence check as the sitemap route). The private
contactInformations.email is intentionally never exposed.
@github-actions github-actions Bot added feature and removed feature labels May 18, 2026
@BatLeDev BatLeDev merged commit 1f2e429 into master May 18, 2026
4 checks passed
@BatLeDev BatLeDev deleted the optim-agents-indexing branch May 18, 2026 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant