Skip to content

[discussion] Catalog-audit use case — is it in scope for the agent's design? #57

@shawnxiao105-afk

Description

@shawnxiao105-afk

Hi @shirshanka — sorry to land a design question in the issue tracker (Discussions are disabled, so this seemed the cleanest place; happy to relocate if there's a better one).

Context: I'm a data-governance practitioner exploring the Analytics Agent as part of evaluating where DataHub could help our team's catalog quality work — multi-source metadata, manual ownership, uneven description coverage. Loved the recent blog post; the "passive accumulation → active evolution" framing resonates strongly.

While playing with the agent locally against datahub docker quickstart sample data, I noticed something I'd like to understand the design intent behind, before going further.

The system prompt seems to anchor the workflow around what I'd call an analyst-question shape:

  • The goal statement at the top frames the agent's job as answering "data questions" terminating in SQL execution.
  • Tool definitions explicitly mark search_documents as "USE THIS FIRST".
  • Workflow Step 1 reinforces it: "Always call search_documents before doing anything else."

The "Documentation = intent / catalog = existence" principle further down separates the two cleanly, but the operational workflow doesn't seem to give catalog-existence a peer-level path — it appears mostly as scaffolding toward SQL.

The behavior I observed (against quickstart sample data):

  • Asked "tell me about SampleHiveDataset". The dataset exists in DataHub (verified via direct GMS call) but has no description, glossary term, or domain assignment.
  • The agent ran search_documents, got zero hits, and concluded the dataset "doesn't exist in the catalog" — without falling through to an entity-level search.
  • /improve-context (per SKILL.md Step 1) explicitly scopes itself to the conversation history, so datasets that nobody has asked about — often the highest-governance-value ones — never enter the improvement candidate set.

My question, purely about design intent (not asking for any fix at this point):

Is the agent intentionally scoped to "analyst asking business questions," with catalog audit / proactive governance treated as a separate concern (perhaps a different surface entirely)? Or is the analyst use case the phase-1 build-out, with the governance angle planned but not yet shaped?

The blog framing read to me like it covers the governance use case, but the implementation seems to keep "active evolution" bounded by what users happen to query. Wanted to understand whether that's deliberate scoping (and if so, the reasoning) or a phase-1 limitation.

Happy to file smaller, behavior-specific issues separately if useful — asked the bigger question first so I don't propose anything that's already considered out of scope.

Thanks for any color. This is the most interesting AI-over-metadata project I've come across, and I'd like to understand its shape correctly before contributing in earnest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions