Polaris POC by Tianhao-Gu · Pull Request #133 · BERDataLakehouse/spark_notebook

Tianhao-Gu · 2026-03-12T20:38:27Z

No description provided.

…polaris" This reverts commit c150825, reversing changes made to 7ce659e.

github-actions · 2026-03-12T20:42:38Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
notebook_utils/berdl_notebook_utils
berdl_settings.py
refresh.py
setup_spark_session.py					471, 473-476, 478-481
notebook_utils/berdl_notebook_utils/agent
tools.py
notebook_utils/berdl_notebook_utils/mcp
operations.py
notebook_utils/berdl_notebook_utils/minio_governance
__init__.py
operations.py					35, 256-257, 1099-1103
notebook_utils/berdl_notebook_utils/spark
connect_server.py					393
database.py					49-50
Project Total

_{This report was generated by python-coverage-comment-action}

codecov · 2026-03-12T20:43:19Z

Codecov Report

❌ Patch coverage is 90.25641% with 19 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
..._utils/berdl_notebook_utils/setup_spark_session.py	80.85%	9 Missing ⚠️
...erdl_notebook_utils/minio_governance/operations.py	91.66%	7 Missing ⚠️
...ebook_utils/berdl_notebook_utils/spark/database.py	86.66%	2 Missing ⚠️
...utils/berdl_notebook_utils/spark/connect_server.py	91.66%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

This PR introduces a Polaris/Iceberg proof-of-concept across the notebook utilities stack: provisioning Polaris credentials, configuring Spark Connect with Iceberg REST catalogs, updating MCP operations to align with catalog-based browsing, and expanding docs + local docker-compose support for running Polaris.

Changes:

Add Polaris credential provisioning + Iceberg REST catalog Spark configuration (server-side via Spark Connect) alongside existing Delta/Hive support.
Introduce admin/user utilities for migration and refresh flows (Delta→Iceberg migration script, environment refresh helper, governance ops additions).
Update local dev orchestration and documentation for the new catalog model (docker-compose adds Polaris services; docs updated; dependency bumps).

Reviewed changes

Copilot reviewed 36 out of 37 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`scripts/migrate_delta_to_iceberg.py`	Adds notebook-oriented utilities to migrate Delta/Hive tables to Iceberg in Polaris catalogs.
`scripts/init-polaris-db.sh`	Creates a `polaris` DB in the shared Postgres instance for Polaris persistence.
`notebook_utils/uv.lock`	Bumps locked dependencies (incl. MCP + MinIO manager clients and various PyPI packages).
`notebook_utils/tests/test_refresh.py`	Adds tests for the new credential + Spark environment refresh helper.
`notebook_utils/tests/test_get_spark_session.py`	Adds tests for Polaris catalog config and Spark Connect immutable-config filtering.
`notebook_utils/tests/test_cache.py`	Adds tests for token-dependent cache management utilities.
`notebook_utils/tests/spark/test_database.py`	Updates/extends namespace-creation tests for both Hive/Delta and Iceberg/Polaris flows.
`notebook_utils/tests/spark/test_data_store.py`	Reorders/imports additional internal helpers for data_store tests.
`notebook_utils/tests/spark/test_connect_server.py`	Expands Spark Connect server tests (catalog config inclusion, port release, kill handling, start flows).
`notebook_utils/tests/minio_governance/test_operations.py`	Adds extensive coverage for Polaris credential caching/provisioning + new governance ops.
`notebook_utils/tests/mcp/test_operations.py`	Updates MCP tests after removing the HMS-mode parameter from database listing.
`notebook_utils/tests/agent/test_prompts.py`	Adds tests for system prompt generation and tool descriptions.
`notebook_utils/tests/agent/test_mcp_tools.py`	Expands tests for MCP tool wrapping/event-loop integration and cleanup behavior.
`notebook_utils/pyproject.toml`	Updates git-sourced client versions for MCP + MinIO manager.
`notebook_utils/berdl_notebook_utils/spark/database.py`	Adds `iceberg=True` path to create namespaces in Polaris catalogs (catalog.namespace).
`notebook_utils/berdl_notebook_utils/spark/data_store.py`	Import ordering tweaks.
`notebook_utils/berdl_notebook_utils/spark/connect_server.py`	Adds server-side writing of Polaris catalog configs into `spark-defaults.conf`; refactors running checks.
`notebook_utils/berdl_notebook_utils/spark/__init__.py`	Updates package docstring to reflect Iceberg-oriented data_store usage.
`notebook_utils/berdl_notebook_utils/setup_spark_session.py`	Adds `_get_catalog_conf`, immutable-config filtering improvements, and Spark Connect catalog warm-up.
`notebook_utils/berdl_notebook_utils/refresh.py`	Adds a single entrypoint to clear caches, re-provision creds, restart Spark Connect, and stop active Spark sessions.
`notebook_utils/berdl_notebook_utils/minio_governance/operations.py`	Adds file-locked caching helper, Polaris provisioning + caching, list_user_names, and admin migration endpoints; deprecates path-sharing helpers.
`notebook_utils/berdl_notebook_utils/minio_governance/__init__.py`	Exposes new Polaris + migration functions at the package level.
`notebook_utils/berdl_notebook_utils/mcp/operations.py`	Updates MCP list operations to reflect Iceberg catalog namespace semantics and removes `use_hms` parameter.
`notebook_utils/berdl_notebook_utils/berdl_settings.py`	Adds Polaris-related settings and makes Hive metastore URI optional.
`notebook_utils/berdl_notebook_utils/agent/tools.py`	Adjusts agent tools to the updated MCP operations signatures.
`notebook_utils/berdl_notebook_utils/__init__.py`	Exports `refresh_spark_environment`.
`docs/user_guide.md`	Updates user guide to describe Iceberg catalog access and migration note.
`docs/tenant_sql_warehouse_guide.md`	Updates tenant warehouse guide to catalog-based (my/tenant) semantics.
`docs/iceberg_migration_guide.md`	Adds a comprehensive Polaris/Iceberg migration guide.
`docs/data_sharing_guide.md`	Updates sharing guidance for Iceberg-era workflows and deprecations.
`docker-compose.yaml`	Adds Polaris services + bootstrap and wires Polaris env vars into local services.
`configs/spark-defaults.conf.template`	Adds Iceberg Spark SQL extensions and updates interceptor configuration.
`configs/jupyter_server_config.py`	Provisions MinIO + Polaris at server startup and starts Spark Connect in a background thread.
`configs/ipython_startup/01-minio-credentials.py`	Removes the MinIO-only startup script (replaced by combined credentials script).
`configs/ipython_startup/01-credentials.py`	Adds combined MinIO + Polaris credential provisioning at kernel startup.
`configs/ipython_startup/00-notebookutils.py`	Adds Polaris credential function + refresh helper to the auto-import set.
`Dockerfile`	Updates base image tag default.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

docker-compose.yaml

notebook_utils/berdl_notebook_utils/berdl_settings.py

Dockerfile

notebook_utils/berdl_notebook_utils/spark/connect_server.py

configs/spark-defaults.conf.template

docs/tenant_sql_warehouse_guide.md

docs/iceberg_migration_guide.md

docs/data_sharing_guide.md

Merged latest main (PR #135 test coverage improvements) into feature/polaris. Resolved 5 conflicts by keeping both sides' test additions where applicable, preferring main's safer patterns (try/finally cleanup, concurrent.futures.TimeoutError, time.sleep patching).

Resolved conflicts in jupyter_server_config.py (keep Polaris provisioning) and test_connect_server.py (keep both catalog and namespace prefix tests).

Copilot

Pull request overview

Copilot reviewed 31 out of 32 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

docker-compose.yaml:132

SERVICE_TEMPLATE is set to spark-notebook, but this compose file now declares the notebook service as spark-notebook-${CI_KBASE_USERNAME} (e.g. spark-notebook-tgu2). If the proxy derives the backend hostname from SERVICE_TEMPLATE, this will route to a non-existent service. Align SERVICE_TEMPLATE with the service naming convention (e.g., include the username placeholder or revert the notebook service name).

      - SERVICE_TEMPLATE=spark-notebook

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

docs/tenant_sql_warehouse_guide.md

docs/iceberg_migration_guide.md

docs/data_sharing_guide.md

docker-compose.yaml

…resh updates - Keep Polaris services + add MMS credential store (PostgreSQL) config - Use rotate_minio_credentials in refresh.py instead of get_minio_credentials - Update minio-manager-service-client to v0.0.13, datalake-mcp-server-client to v0.0.8 - Add init-mms-db.sh as 03-init-mms-db.sh alongside polaris-db init - Merge test coverage for both Polaris and rotate credential paths

…icies, error handling, relative imports Resolves merge conflicts across operations.py, __init__.py, data_store.py, and test_operations.py. Makes polaris-specific client imports conditional since governance client v0.0.14 doesn't include polaris endpoints.

Copilot

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 12 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

docs/iceberg_migration_guide.md

scripts/migrate_delta_to_iceberg.py

docker-compose.yaml

docs/iceberg_migration_guide.md

notebook_utils/berdl_notebook_utils/spark/database.py

docker-compose.yaml

docs/tenant_sql_warehouse_guide.md

Resolve conflicts: - docker-compose.yaml: keep both Polaris and Trino env vars - refresh.py: use _get_credentials_cache_path() + _get_polaris_cache_path() - test_operations.py: merge coverage tests from main with Polaris tests, fix list_user_names_sync patch target

Copilot

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

notebook_utils/berdl_notebook_utils/spark/database.py

notebook_utils/berdl_notebook_utils/setup_spark_session.py

docker-compose.yaml

docs/iceberg_migration_guide.md

notebook_utils/berdl_notebook_utils/minio_governance/operations.py

- Remove duplicate list_user_names import - Remove duplicate UserNamesResponse import (already from governance_client.models) - Remove duplicate TestListUserNames class (kept original with more thorough assertions)

Copilot

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

scripts/migrate_delta_to_iceberg.py

notebook_utils/berdl_notebook_utils/spark/connect_server.py

notebook_utils/berdl_notebook_utils/minio_governance/operations.py

notebook_utils/berdl_notebook_utils/spark/database.py

configs/spark-defaults.conf.template

docker-compose.yaml

notebook_utils/berdl_notebook_utils/minio_governance/operations.py

Resolve conflicts in operations.py, refresh.py, and their tests by keeping the feature/polaris versions (file-based credential caching and Polaris credential refresh).

Align MinIO credential flow with main (PR #142): use direct API calls instead of local file cache with file locking. The governance API caches credentials server-side in PostgreSQL. Polaris credentials still use file-based caching since the provisioning API is heavier weight.

Copilot

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-23T19:14:30Z

notebook_utils/berdl_notebook_utils/minio_governance/operations.py

+    """Write Polaris credentials to cache file."""
+    try:
+        with open(cache_path, "w") as f:
+            json.dump(credentials, f)
+    except (OSError, TypeError):


Polaris credentials (including client_secret) are persisted to a plaintext JSON file in the user’s home directory. Please at least write this file with restrictive permissions (e.g., chmod 0600) and use an atomic write (temp file + replace) to avoid partial/corrupt reads. Also consider whether caching the secret on disk is necessary vs caching only non-sensitive fields or using a short TTL.

Copilot · 2026-03-23T19:14:31Z

notebook_utils/berdl_notebook_utils/minio_governance/operations.py

+    data = api_response.to_dict()
+    return {
+        "client_id": data.get("client_id", ""),
+        "client_secret": data.get("client_secret", ""),
+        "personal_catalog": data.get("personal_catalog", ""),
+        "tenant_catalogs": data.get("tenant_catalogs", []),


_fetch_polaris_credentials() uses empty-string defaults for required fields. If the API response is missing client_id/client_secret/personal_catalog, this can lead to caching invalid credentials and setting POLARIS_CREDENTIAL to ":". Consider validating required fields are non-empty (and tenant_catalogs is a list) before writing the cache / exporting env vars; otherwise return None (or raise) to surface provisioning failures.

Suggested change

data = api_response.to_dict()

return {

"client_id": data.get("client_id", ""),

"client_secret": data.get("client_secret", ""),

"personal_catalog": data.get("personal_catalog", ""),

"tenant_catalogs": data.get("tenant_catalogs", []),

data = api_response.to_dict() if api_response is not None else {}

client_id = data.get("client_id")

client_secret = data.get("client_secret")

personal_catalog = data.get("personal_catalog")

tenant_catalogs = data.get("tenant_catalogs")

# Validate required fields are present and non-empty before caching/returning

if not client_id or not client_secret or not personal_catalog:

polaris_logger.warning(

"Polaris provisioning returned invalid credentials (missing client_id, client_secret, or personal_catalog)"

)

return None

# tenant_catalogs is optional but must be a list if present

if tenant_catalogs is None:

tenant_catalogs = []

elif not isinstance(tenant_catalogs, list):

polaris_logger.warning(

"Polaris provisioning returned invalid tenant_catalogs (expected list)"

)

return None

return {

"client_id": client_id,

"client_secret": client_secret,

"personal_catalog": personal_catalog,

"tenant_catalogs": tenant_catalogs,

Copilot · 2026-03-23T19:14:31Z

docker-compose.yaml

+    volumes:
+      # Mount the shared /home directory to access all users' credentials
+      # This allows the MCP server to dynamically read any user's credentials
+      # from /home/{username}/.berdl_minio_credentials
+      # In K8s: mount the parent directory or use a shared volume
+      - users_home:/home:ro


The comment says the MCP server reads /home/{username}/.berdl_minio_credentials, but this PR introduces .berdl_polaris_credentials and doesn’t create a .berdl_minio_credentials file. Please update the comment. Also, mounting the entire /home volume into the MCP server can expose cached secrets (e.g., Polaris client_secret) across users; prefer fetching per-request from the governance API or mounting only the minimal required path/file.

Suggested change

volumes:

# Mount the shared /home directory to access all users' credentials

# This allows the MCP server to dynamically read any user's credentials

# from /home/{username}/.berdl_minio_credentials

# In K8s: mount the parent directory or use a shared volume

- users_home:/home:ro

Copilot · 2026-03-23T19:14:32Z

docs/tenant_sql_warehouse_guide.md

+| Configuration | Catalog | Example Namespace |
+|--------------|---------|-------------------|
+| `create_namespace_if_not_exists(spark, "analysis")` | `my` (personal) | `my.analysis` |
+| `create_namespace_if_not_exists(spark, "research", tenant_name="kbase")` | `kbase` (tenant) | `kbase.research` |


These examples show create_namespace_if_not_exists(...) returning my./. without passing iceberg=True, but the implementation still defaults to the Delta/Hive flow unless iceberg=True is provided. Either update the examples to include iceberg=True or change create_namespace_if_not_exists’s default/auto-detection so the docs match behavior.

Copilot · 2026-03-23T19:14:32Z

docs/iceberg_migration_guide.md

+# Personal namespace (same call)
+ns = create_namespace_if_not_exists(
+    spark, "analysis"
+)
+# Returns: "my.analysis"


This section says the create_namespace_if_not_exists() call is unchanged and returns my./., but the current implementation requires iceberg=True to use the Polaris/Iceberg flow. Please update these examples (add iceberg=True) or adjust the function default so the guide matches actual behavior.

Copilot · 2026-03-23T19:14:32Z

docker-compose.yaml

+  # Service names use the pattern: spark-notebook-{CI_KBASE_USERNAME}
+  # Update these keys if you change the usernames in .env
+  spark-notebook-tgu2:
    # image: ghcr.io/berdatalakehouse/spark_notebook:main
    # platform: linux/amd64


The notebook service key was renamed to spark-notebook-tgu2, but this compose file still contains other references to the old hostname "spark-notebook" (e.g., SERVICE_TEMPLATE and SPARK_CONNECT_URL_TEMPLATE). As-is, those services will likely fail to connect/resolve the notebook container; please update the remaining references to match the new naming pattern (spark-notebook-${CI_KBASE_USERNAME}).

…o feature/polaris

Revert "Merge pull request #132 from BERDataLakehouse/feature/revert_…

4a41b5d

…polaris" This reverts commit c150825, reversing changes made to 7ce659e.

Copilot AI review requested due to automatic review settings March 12, 2026 20:38

Copilot started reviewing on behalf of Tianhao-Gu March 12, 2026 20:38 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

Tianhao-Gu added 2 commits March 12, 2026 19:51

merge main

f87ddab

Copilot AI review requested due to automatic review settings March 13, 2026 20:09

Copilot started reviewing on behalf of Tianhao-Gu March 13, 2026 20:10 View session

Merge main into feature/polaris — resolve conflicts

96171a3

Resolved conflicts in jupyter_server_config.py (keep Polaris provisioning) and test_connect_server.py (keep both catalog and namespace prefix tests).

Copilot AI reviewed Mar 13, 2026

View reviewed changes

docs/tenant_sql_warehouse_guide.md Show resolved Hide resolved

docs/iceberg_migration_guide.md Show resolved Hide resolved

docs/data_sharing_guide.md Show resolved Hide resolved

docker-compose.yaml Show resolved Hide resolved

Tianhao-Gu added 2 commits March 15, 2026 12:32

Copilot AI review requested due to automatic review settings March 16, 2026 02:17

Copilot started reviewing on behalf of Tianhao-Gu March 16, 2026 02:17 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

Tianhao-Gu added 2 commits March 22, 2026 20:38

Merge remote-tracking branch 'origin/main' into feature/polaris

a1f1f71

Copilot AI review requested due to automatic review settings March 23, 2026 01:51

Copilot started reviewing on behalf of Tianhao-Gu March 23, 2026 01:51 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

Tianhao-Gu added 2 commits March 22, 2026 21:03

Fix lint errors: remove duplicate imports and test class

8b0f05d

- Remove duplicate list_user_names import - Remove duplicate UserNamesResponse import (already from governance_client.models) - Remove duplicate TestListUserNames class (kept original with more thorough assertions)

run formatter

745af4d

Copilot AI review requested due to automatic review settings March 23, 2026 02:04

Copilot started reviewing on behalf of Tianhao-Gu March 23, 2026 02:04 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

Tianhao-Gu added 2 commits March 23, 2026 13:32

Merge main into feature/polaris

79e98dc

Resolve conflicts in operations.py, refresh.py, and their tests by keeping the feature/polaris versions (file-based credential caching and Polaris credential refresh).

Copilot AI review requested due to automatic review settings March 23, 2026 19:07

Copilot started reviewing on behalf of Tianhao-Gu March 23, 2026 19:08 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

Merge branch 'main' of github.com:BERDataLakehouse/spark_notebook int…

a9f65d7

…o feature/polaris

-    data = api_response.to_dict()
-    return {
-        "client_id": data.get("client_id", ""),
-        "client_secret": data.get("client_secret", ""),
-        "personal_catalog": data.get("personal_catalog", ""),
-        "tenant_catalogs": data.get("tenant_catalogs", []),
+    data = api_response.to_dict() if api_response is not None else {}
+    client_id = data.get("client_id")
+    client_secret = data.get("client_secret")
+    personal_catalog = data.get("personal_catalog")
+    tenant_catalogs = data.get("tenant_catalogs")
+    # Validate required fields are present and non-empty before caching/returning
+    if not client_id or not client_secret or not personal_catalog:
+        polaris_logger.warning(
+            "Polaris provisioning returned invalid credentials (missing client_id, client_secret, or personal_catalog)"
+        )
+        return None
+    # tenant_catalogs is optional but must be a list if present
+    if tenant_catalogs is None:
+        tenant_catalogs = []
+    elif not isinstance(tenant_catalogs, list):
+        polaris_logger.warning(
+            "Polaris provisioning returned invalid tenant_catalogs (expected list)"
+        )
+        return None
+    return {
+        "client_id": client_id,
+        "client_secret": client_secret,
+        "personal_catalog": personal_catalog,
+        "tenant_catalogs": tenant_catalogs,

Conversation

Tianhao-Gu commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage report

Uh oh!

codecov bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 12, 2026 •

edited

Loading

codecov bot commented Mar 12, 2026 •

edited

Loading