[discussion] Catalog-audit use case — is it in scope for the agent's design?

Hi @shirshanka — sorry to land a design question in the issue tracker (Discussions are disabled, so this seemed the cleanest place; happy to relocate if there's a better one).

Context: I'm a data-governance practitioner exploring the Analytics Agent as part of evaluating where DataHub could help our team's catalog quality work — multi-source metadata, manual ownership, uneven description coverage. Loved the recent blog post; the "passive accumulation → active evolution" framing resonates strongly.

While playing with the agent locally against `datahub docker quickstart` sample data, I noticed something I'd like to understand the design intent behind, before going further.

The system prompt seems to anchor the workflow around what I'd call an **analyst-question shape**:
- The goal statement at the top frames the agent's job as answering "data questions" terminating in SQL execution.
- Tool definitions explicitly mark `search_documents` as "USE THIS FIRST".
- Workflow Step 1 reinforces it: *"Always call search_documents before doing anything else."*

The "Documentation = intent / catalog = existence" principle further down separates the two cleanly, but the operational workflow doesn't seem to give catalog-existence a peer-level path — it appears mostly as scaffolding toward SQL.

The behavior I observed (against quickstart sample data):

- Asked *"tell me about SampleHiveDataset"*. The dataset exists in DataHub (verified via direct GMS call) but has no description, glossary term, or domain assignment.
- The agent ran `search_documents`, got zero hits, and concluded the dataset *"doesn't exist in the catalog"* — without falling through to an entity-level search.
- `/improve-context` (per `SKILL.md` Step 1) explicitly scopes itself to the conversation history, so datasets that nobody has asked about — often the highest-governance-value ones — never enter the improvement candidate set.

My question, purely about design intent (not asking for any fix at this point):

**Is the agent intentionally scoped to "analyst asking business questions," with catalog audit / proactive governance treated as a separate concern (perhaps a different surface entirely)? Or is the analyst use case the phase-1 build-out, with the governance angle planned but not yet shaped?**

The blog framing read to me like it covers the governance use case, but the implementation seems to keep "active evolution" bounded by what users happen to query. Wanted to understand whether that's deliberate scoping (and if so, the reasoning) or a phase-1 limitation.

Happy to file smaller, behavior-specific issues separately if useful — asked the bigger question first so I don't propose anything that's already considered out of scope.

Thanks for any color. This is the most interesting AI-over-metadata project I've come across, and I'd like to understand its shape correctly before contributing in earnest.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[discussion] Catalog-audit use case — is it in scope for the agent's design? #57

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[discussion] Catalog-audit use case — is it in scope for the agent's design? #57

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions