Skip to content

Polaris POC#133

Open
Tianhao-Gu wants to merge 13 commits intomainfrom
feature/polaris
Open

Polaris POC#133
Tianhao-Gu wants to merge 13 commits intomainfrom
feature/polaris

Conversation

@Tianhao-Gu
Copy link
Copy Markdown
Contributor

No description provided.

…polaris"

This reverts commit c150825, reversing
changes made to 7ce659e.
Copilot AI review requested due to automatic review settings March 12, 2026 20:38
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 12, 2026

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  notebook_utils/berdl_notebook_utils
  berdl_settings.py
  refresh.py
  setup_spark_session.py 471, 473-476, 478-481
  notebook_utils/berdl_notebook_utils/agent
  tools.py
  notebook_utils/berdl_notebook_utils/mcp
  operations.py
  notebook_utils/berdl_notebook_utils/minio_governance
  __init__.py
  operations.py 35, 256-257, 1099-1103
  notebook_utils/berdl_notebook_utils/spark
  connect_server.py 393
  database.py 49-50
Project Total  

This report was generated by python-coverage-comment-action

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 12, 2026

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a Polaris/Iceberg proof-of-concept across the notebook utilities stack: provisioning Polaris credentials, configuring Spark Connect with Iceberg REST catalogs, updating MCP operations to align with catalog-based browsing, and expanding docs + local docker-compose support for running Polaris.

Changes:

  • Add Polaris credential provisioning + Iceberg REST catalog Spark configuration (server-side via Spark Connect) alongside existing Delta/Hive support.
  • Introduce admin/user utilities for migration and refresh flows (Delta→Iceberg migration script, environment refresh helper, governance ops additions).
  • Update local dev orchestration and documentation for the new catalog model (docker-compose adds Polaris services; docs updated; dependency bumps).

Reviewed changes

Copilot reviewed 36 out of 37 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
scripts/migrate_delta_to_iceberg.py Adds notebook-oriented utilities to migrate Delta/Hive tables to Iceberg in Polaris catalogs.
scripts/init-polaris-db.sh Creates a polaris DB in the shared Postgres instance for Polaris persistence.
notebook_utils/uv.lock Bumps locked dependencies (incl. MCP + MinIO manager clients and various PyPI packages).
notebook_utils/tests/test_refresh.py Adds tests for the new credential + Spark environment refresh helper.
notebook_utils/tests/test_get_spark_session.py Adds tests for Polaris catalog config and Spark Connect immutable-config filtering.
notebook_utils/tests/test_cache.py Adds tests for token-dependent cache management utilities.
notebook_utils/tests/spark/test_database.py Updates/extends namespace-creation tests for both Hive/Delta and Iceberg/Polaris flows.
notebook_utils/tests/spark/test_data_store.py Reorders/imports additional internal helpers for data_store tests.
notebook_utils/tests/spark/test_connect_server.py Expands Spark Connect server tests (catalog config inclusion, port release, kill handling, start flows).
notebook_utils/tests/minio_governance/test_operations.py Adds extensive coverage for Polaris credential caching/provisioning + new governance ops.
notebook_utils/tests/mcp/test_operations.py Updates MCP tests after removing the HMS-mode parameter from database listing.
notebook_utils/tests/agent/test_prompts.py Adds tests for system prompt generation and tool descriptions.
notebook_utils/tests/agent/test_mcp_tools.py Expands tests for MCP tool wrapping/event-loop integration and cleanup behavior.
notebook_utils/pyproject.toml Updates git-sourced client versions for MCP + MinIO manager.
notebook_utils/berdl_notebook_utils/spark/database.py Adds iceberg=True path to create namespaces in Polaris catalogs (catalog.namespace).
notebook_utils/berdl_notebook_utils/spark/data_store.py Import ordering tweaks.
notebook_utils/berdl_notebook_utils/spark/connect_server.py Adds server-side writing of Polaris catalog configs into spark-defaults.conf; refactors running checks.
notebook_utils/berdl_notebook_utils/spark/__init__.py Updates package docstring to reflect Iceberg-oriented data_store usage.
notebook_utils/berdl_notebook_utils/setup_spark_session.py Adds _get_catalog_conf, immutable-config filtering improvements, and Spark Connect catalog warm-up.
notebook_utils/berdl_notebook_utils/refresh.py Adds a single entrypoint to clear caches, re-provision creds, restart Spark Connect, and stop active Spark sessions.
notebook_utils/berdl_notebook_utils/minio_governance/operations.py Adds file-locked caching helper, Polaris provisioning + caching, list_user_names, and admin migration endpoints; deprecates path-sharing helpers.
notebook_utils/berdl_notebook_utils/minio_governance/__init__.py Exposes new Polaris + migration functions at the package level.
notebook_utils/berdl_notebook_utils/mcp/operations.py Updates MCP list operations to reflect Iceberg catalog namespace semantics and removes use_hms parameter.
notebook_utils/berdl_notebook_utils/berdl_settings.py Adds Polaris-related settings and makes Hive metastore URI optional.
notebook_utils/berdl_notebook_utils/agent/tools.py Adjusts agent tools to the updated MCP operations signatures.
notebook_utils/berdl_notebook_utils/__init__.py Exports refresh_spark_environment.
docs/user_guide.md Updates user guide to describe Iceberg catalog access and migration note.
docs/tenant_sql_warehouse_guide.md Updates tenant warehouse guide to catalog-based (my/tenant) semantics.
docs/iceberg_migration_guide.md Adds a comprehensive Polaris/Iceberg migration guide.
docs/data_sharing_guide.md Updates sharing guidance for Iceberg-era workflows and deprecations.
docker-compose.yaml Adds Polaris services + bootstrap and wires Polaris env vars into local services.
configs/spark-defaults.conf.template Adds Iceberg Spark SQL extensions and updates interceptor configuration.
configs/jupyter_server_config.py Provisions MinIO + Polaris at server startup and starts Spark Connect in a background thread.
configs/ipython_startup/01-minio-credentials.py Removes the MinIO-only startup script (replaced by combined credentials script).
configs/ipython_startup/01-credentials.py Adds combined MinIO + Polaris credential provisioning at kernel startup.
configs/ipython_startup/00-notebookutils.py Adds Polaris credential function + refresh helper to the auto-import set.
Dockerfile Updates base image tag default.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Merged latest main (PR #135 test coverage improvements) into feature/polaris.
Resolved 5 conflicts by keeping both sides' test additions where applicable,
preferring main's safer patterns (try/finally cleanup, concurrent.futures.TimeoutError,
time.sleep patching).
Copilot AI review requested due to automatic review settings March 13, 2026 20:09
Resolved conflicts in jupyter_server_config.py (keep Polaris provisioning)
and test_connect_server.py (keep both catalog and namespace prefix tests).
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 31 out of 32 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

docker-compose.yaml:132

  • SERVICE_TEMPLATE is set to spark-notebook, but this compose file now declares the notebook service as spark-notebook-${CI_KBASE_USERNAME} (e.g. spark-notebook-tgu2). If the proxy derives the backend hostname from SERVICE_TEMPLATE, this will route to a non-existent service. Align SERVICE_TEMPLATE with the service naming convention (e.g., include the username placeholder or revert the notebook service name).
      - SERVICE_TEMPLATE=spark-notebook

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

…resh updates

- Keep Polaris services + add MMS credential store (PostgreSQL) config
- Use rotate_minio_credentials in refresh.py instead of get_minio_credentials
- Update minio-manager-service-client to v0.0.13, datalake-mcp-server-client to v0.0.8
- Add init-mms-db.sh as 03-init-mms-db.sh alongside polaris-db init
- Merge test coverage for both Polaris and rotate credential paths
…icies, error handling, relative imports

Resolves merge conflicts across operations.py, __init__.py, data_store.py, and test_operations.py.
Makes polaris-specific client imports conditional since governance client v0.0.14 doesn't include polaris endpoints.
Copilot AI review requested due to automatic review settings March 16, 2026 02:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 12 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Resolve conflicts:
- docker-compose.yaml: keep both Polaris and Trino env vars
- refresh.py: use _get_credentials_cache_path() + _get_polaris_cache_path()
- test_operations.py: merge coverage tests from main with Polaris tests,
  fix list_user_names_sync patch target
Copilot AI review requested due to automatic review settings March 23, 2026 01:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Remove duplicate list_user_names import
- Remove duplicate UserNamesResponse import (already from governance_client.models)
- Remove duplicate TestListUserNames class (kept original with more thorough assertions)
Copilot AI review requested due to automatic review settings March 23, 2026 02:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Resolve conflicts in operations.py, refresh.py, and their tests by
keeping the feature/polaris versions (file-based credential caching
and Polaris credential refresh).
Align MinIO credential flow with main (PR #142): use direct API calls
instead of local file cache with file locking. The governance API
caches credentials server-side in PostgreSQL.

Polaris credentials still use file-based caching since the provisioning
API is heavier weight.
Copilot AI review requested due to automatic review settings March 23, 2026 19:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +166 to +170
"""Write Polaris credentials to cache file."""
try:
with open(cache_path, "w") as f:
json.dump(credentials, f)
except (OSError, TypeError):
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Polaris credentials (including client_secret) are persisted to a plaintext JSON file in the user’s home directory. Please at least write this file with restrictive permissions (e.g., chmod 0600) and use an atomic write (temp file + replace) to avoid partial/corrupt reads. Also consider whether caching the secret on disk is necessary vs caching only non-sensitive fields or using a short TTL.

Copilot uses AI. Check for mistakes.
Comment on lines +271 to +276
data = api_response.to_dict()
return {
"client_id": data.get("client_id", ""),
"client_secret": data.get("client_secret", ""),
"personal_catalog": data.get("personal_catalog", ""),
"tenant_catalogs": data.get("tenant_catalogs", []),
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_fetch_polaris_credentials() uses empty-string defaults for required fields. If the API response is missing client_id/client_secret/personal_catalog, this can lead to caching invalid credentials and setting POLARIS_CREDENTIAL to ":". Consider validating required fields are non-empty (and tenant_catalogs is a list) before writing the cache / exporting env vars; otherwise return None (or raise) to surface provisioning failures.

Suggested change
data = api_response.to_dict()
return {
"client_id": data.get("client_id", ""),
"client_secret": data.get("client_secret", ""),
"personal_catalog": data.get("personal_catalog", ""),
"tenant_catalogs": data.get("tenant_catalogs", []),
data = api_response.to_dict() if api_response is not None else {}
client_id = data.get("client_id")
client_secret = data.get("client_secret")
personal_catalog = data.get("personal_catalog")
tenant_catalogs = data.get("tenant_catalogs")
# Validate required fields are present and non-empty before caching/returning
if not client_id or not client_secret or not personal_catalog:
polaris_logger.warning(
"Polaris provisioning returned invalid credentials (missing client_id, client_secret, or personal_catalog)"
)
return None
# tenant_catalogs is optional but must be a list if present
if tenant_catalogs is None:
tenant_catalogs = []
elif not isinstance(tenant_catalogs, list):
polaris_logger.warning(
"Polaris provisioning returned invalid tenant_catalogs (expected list)"
)
return None
return {
"client_id": client_id,
"client_secret": client_secret,
"personal_catalog": personal_catalog,
"tenant_catalogs": tenant_catalogs,

Copilot uses AI. Check for mistakes.
Comment on lines +197 to +202
volumes:
# Mount the shared /home directory to access all users' credentials
# This allows the MCP server to dynamically read any user's credentials
# from /home/{username}/.berdl_minio_credentials
# In K8s: mount the parent directory or use a shared volume
- users_home:/home:ro
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says the MCP server reads /home/{username}/.berdl_minio_credentials, but this PR introduces .berdl_polaris_credentials and doesn’t create a .berdl_minio_credentials file. Please update the comment. Also, mounting the entire /home volume into the MCP server can expose cached secrets (e.g., Polaris client_secret) across users; prefer fetching per-request from the governance API or mounting only the minimal required path/file.

Suggested change
volumes:
# Mount the shared /home directory to access all users' credentials
# This allows the MCP server to dynamically read any user's credentials
# from /home/{username}/.berdl_minio_credentials
# In K8s: mount the parent directory or use a shared volume
- users_home:/home:ro

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +17
| Configuration | Catalog | Example Namespace |
|--------------|---------|-------------------|
| `create_namespace_if_not_exists(spark, "analysis")` | `my` (personal) | `my.analysis` |
| `create_namespace_if_not_exists(spark, "research", tenant_name="kbase")` | `kbase` (tenant) | `kbase.research` |
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples show create_namespace_if_not_exists(...) returning my./. without passing iceberg=True, but the implementation still defaults to the Delta/Hive flow unless iceberg=True is provided. Either update the examples to include iceberg=True or change create_namespace_if_not_exists’s default/auto-detection so the docs match behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +68 to +72
# Personal namespace (same call)
ns = create_namespace_if_not_exists(
spark, "analysis"
)
# Returns: "my.analysis"
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section says the create_namespace_if_not_exists() call is unchanged and returns my./., but the current implementation requires iceberg=True to use the Polaris/Iceberg flow. Please update these examples (add iceberg=True) or adjust the function default so the guide matches actual behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +71 to 75
# Service names use the pattern: spark-notebook-{CI_KBASE_USERNAME}
# Update these keys if you change the usernames in .env
spark-notebook-tgu2:
# image: ghcr.io/berdatalakehouse/spark_notebook:main
# platform: linux/amd64
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The notebook service key was renamed to spark-notebook-tgu2, but this compose file still contains other references to the old hostname "spark-notebook" (e.g., SERVICE_TEMPLATE and SPARK_CONNECT_URL_TEMPLATE). As-is, those services will likely fail to connect/resolve the notebook container; please update the remaining references to match the new naming pattern (spark-notebook-${CI_KBASE_USERNAME}).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants