Conversation
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
This PR introduces a Polaris/Iceberg proof-of-concept across the notebook utilities stack: provisioning Polaris credentials, configuring Spark Connect with Iceberg REST catalogs, updating MCP operations to align with catalog-based browsing, and expanding docs + local docker-compose support for running Polaris.
Changes:
- Add Polaris credential provisioning + Iceberg REST catalog Spark configuration (server-side via Spark Connect) alongside existing Delta/Hive support.
- Introduce admin/user utilities for migration and refresh flows (Delta→Iceberg migration script, environment refresh helper, governance ops additions).
- Update local dev orchestration and documentation for the new catalog model (docker-compose adds Polaris services; docs updated; dependency bumps).
Reviewed changes
Copilot reviewed 36 out of 37 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
scripts/migrate_delta_to_iceberg.py |
Adds notebook-oriented utilities to migrate Delta/Hive tables to Iceberg in Polaris catalogs. |
scripts/init-polaris-db.sh |
Creates a polaris DB in the shared Postgres instance for Polaris persistence. |
notebook_utils/uv.lock |
Bumps locked dependencies (incl. MCP + MinIO manager clients and various PyPI packages). |
notebook_utils/tests/test_refresh.py |
Adds tests for the new credential + Spark environment refresh helper. |
notebook_utils/tests/test_get_spark_session.py |
Adds tests for Polaris catalog config and Spark Connect immutable-config filtering. |
notebook_utils/tests/test_cache.py |
Adds tests for token-dependent cache management utilities. |
notebook_utils/tests/spark/test_database.py |
Updates/extends namespace-creation tests for both Hive/Delta and Iceberg/Polaris flows. |
notebook_utils/tests/spark/test_data_store.py |
Reorders/imports additional internal helpers for data_store tests. |
notebook_utils/tests/spark/test_connect_server.py |
Expands Spark Connect server tests (catalog config inclusion, port release, kill handling, start flows). |
notebook_utils/tests/minio_governance/test_operations.py |
Adds extensive coverage for Polaris credential caching/provisioning + new governance ops. |
notebook_utils/tests/mcp/test_operations.py |
Updates MCP tests after removing the HMS-mode parameter from database listing. |
notebook_utils/tests/agent/test_prompts.py |
Adds tests for system prompt generation and tool descriptions. |
notebook_utils/tests/agent/test_mcp_tools.py |
Expands tests for MCP tool wrapping/event-loop integration and cleanup behavior. |
notebook_utils/pyproject.toml |
Updates git-sourced client versions for MCP + MinIO manager. |
notebook_utils/berdl_notebook_utils/spark/database.py |
Adds iceberg=True path to create namespaces in Polaris catalogs (catalog.namespace). |
notebook_utils/berdl_notebook_utils/spark/data_store.py |
Import ordering tweaks. |
notebook_utils/berdl_notebook_utils/spark/connect_server.py |
Adds server-side writing of Polaris catalog configs into spark-defaults.conf; refactors running checks. |
notebook_utils/berdl_notebook_utils/spark/__init__.py |
Updates package docstring to reflect Iceberg-oriented data_store usage. |
notebook_utils/berdl_notebook_utils/setup_spark_session.py |
Adds _get_catalog_conf, immutable-config filtering improvements, and Spark Connect catalog warm-up. |
notebook_utils/berdl_notebook_utils/refresh.py |
Adds a single entrypoint to clear caches, re-provision creds, restart Spark Connect, and stop active Spark sessions. |
notebook_utils/berdl_notebook_utils/minio_governance/operations.py |
Adds file-locked caching helper, Polaris provisioning + caching, list_user_names, and admin migration endpoints; deprecates path-sharing helpers. |
notebook_utils/berdl_notebook_utils/minio_governance/__init__.py |
Exposes new Polaris + migration functions at the package level. |
notebook_utils/berdl_notebook_utils/mcp/operations.py |
Updates MCP list operations to reflect Iceberg catalog namespace semantics and removes use_hms parameter. |
notebook_utils/berdl_notebook_utils/berdl_settings.py |
Adds Polaris-related settings and makes Hive metastore URI optional. |
notebook_utils/berdl_notebook_utils/agent/tools.py |
Adjusts agent tools to the updated MCP operations signatures. |
notebook_utils/berdl_notebook_utils/__init__.py |
Exports refresh_spark_environment. |
docs/user_guide.md |
Updates user guide to describe Iceberg catalog access and migration note. |
docs/tenant_sql_warehouse_guide.md |
Updates tenant warehouse guide to catalog-based (my/tenant) semantics. |
docs/iceberg_migration_guide.md |
Adds a comprehensive Polaris/Iceberg migration guide. |
docs/data_sharing_guide.md |
Updates sharing guidance for Iceberg-era workflows and deprecations. |
docker-compose.yaml |
Adds Polaris services + bootstrap and wires Polaris env vars into local services. |
configs/spark-defaults.conf.template |
Adds Iceberg Spark SQL extensions and updates interceptor configuration. |
configs/jupyter_server_config.py |
Provisions MinIO + Polaris at server startup and starts Spark Connect in a background thread. |
configs/ipython_startup/01-minio-credentials.py |
Removes the MinIO-only startup script (replaced by combined credentials script). |
configs/ipython_startup/01-credentials.py |
Adds combined MinIO + Polaris credential provisioning at kernel startup. |
configs/ipython_startup/00-notebookutils.py |
Adds Polaris credential function + refresh helper to the auto-import set. |
Dockerfile |
Updates base image tag default. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Merged latest main (PR #135 test coverage improvements) into feature/polaris. Resolved 5 conflicts by keeping both sides' test additions where applicable, preferring main's safer patterns (try/finally cleanup, concurrent.futures.TimeoutError, time.sleep patching).
Resolved conflicts in jupyter_server_config.py (keep Polaris provisioning) and test_connect_server.py (keep both catalog and namespace prefix tests).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 31 out of 32 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
docker-compose.yaml:132
SERVICE_TEMPLATEis set tospark-notebook, but this compose file now declares the notebook service asspark-notebook-${CI_KBASE_USERNAME}(e.g.spark-notebook-tgu2). If the proxy derives the backend hostname fromSERVICE_TEMPLATE, this will route to a non-existent service. AlignSERVICE_TEMPLATEwith the service naming convention (e.g., include the username placeholder or revert the notebook service name).
- SERVICE_TEMPLATE=spark-notebook
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
…resh updates - Keep Polaris services + add MMS credential store (PostgreSQL) config - Use rotate_minio_credentials in refresh.py instead of get_minio_credentials - Update minio-manager-service-client to v0.0.13, datalake-mcp-server-client to v0.0.8 - Add init-mms-db.sh as 03-init-mms-db.sh alongside polaris-db init - Merge test coverage for both Polaris and rotate credential paths
…icies, error handling, relative imports Resolves merge conflicts across operations.py, __init__.py, data_store.py, and test_operations.py. Makes polaris-specific client imports conditional since governance client v0.0.14 doesn't include polaris endpoints.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 28 out of 28 changed files in this pull request and generated 12 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Resolve conflicts: - docker-compose.yaml: keep both Polaris and Trino env vars - refresh.py: use _get_credentials_cache_path() + _get_polaris_cache_path() - test_operations.py: merge coverage tests from main with Polaris tests, fix list_user_names_sync patch target
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 28 out of 28 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Remove duplicate list_user_names import - Remove duplicate UserNamesResponse import (already from governance_client.models) - Remove duplicate TestListUserNames class (kept original with more thorough assertions)
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 28 out of 28 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Resolve conflicts in operations.py, refresh.py, and their tests by keeping the feature/polaris versions (file-based credential caching and Polaris credential refresh).
Align MinIO credential flow with main (PR #142): use direct API calls instead of local file cache with file locking. The governance API caches credentials server-side in PostgreSQL. Polaris credentials still use file-based caching since the provisioning API is heavier weight.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 28 out of 28 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """Write Polaris credentials to cache file.""" | ||
| try: | ||
| with open(cache_path, "w") as f: | ||
| json.dump(credentials, f) | ||
| except (OSError, TypeError): |
There was a problem hiding this comment.
Polaris credentials (including client_secret) are persisted to a plaintext JSON file in the user’s home directory. Please at least write this file with restrictive permissions (e.g., chmod 0600) and use an atomic write (temp file + replace) to avoid partial/corrupt reads. Also consider whether caching the secret on disk is necessary vs caching only non-sensitive fields or using a short TTL.
| data = api_response.to_dict() | ||
| return { | ||
| "client_id": data.get("client_id", ""), | ||
| "client_secret": data.get("client_secret", ""), | ||
| "personal_catalog": data.get("personal_catalog", ""), | ||
| "tenant_catalogs": data.get("tenant_catalogs", []), |
There was a problem hiding this comment.
_fetch_polaris_credentials() uses empty-string defaults for required fields. If the API response is missing client_id/client_secret/personal_catalog, this can lead to caching invalid credentials and setting POLARIS_CREDENTIAL to ":". Consider validating required fields are non-empty (and tenant_catalogs is a list) before writing the cache / exporting env vars; otherwise return None (or raise) to surface provisioning failures.
| data = api_response.to_dict() | |
| return { | |
| "client_id": data.get("client_id", ""), | |
| "client_secret": data.get("client_secret", ""), | |
| "personal_catalog": data.get("personal_catalog", ""), | |
| "tenant_catalogs": data.get("tenant_catalogs", []), | |
| data = api_response.to_dict() if api_response is not None else {} | |
| client_id = data.get("client_id") | |
| client_secret = data.get("client_secret") | |
| personal_catalog = data.get("personal_catalog") | |
| tenant_catalogs = data.get("tenant_catalogs") | |
| # Validate required fields are present and non-empty before caching/returning | |
| if not client_id or not client_secret or not personal_catalog: | |
| polaris_logger.warning( | |
| "Polaris provisioning returned invalid credentials (missing client_id, client_secret, or personal_catalog)" | |
| ) | |
| return None | |
| # tenant_catalogs is optional but must be a list if present | |
| if tenant_catalogs is None: | |
| tenant_catalogs = [] | |
| elif not isinstance(tenant_catalogs, list): | |
| polaris_logger.warning( | |
| "Polaris provisioning returned invalid tenant_catalogs (expected list)" | |
| ) | |
| return None | |
| return { | |
| "client_id": client_id, | |
| "client_secret": client_secret, | |
| "personal_catalog": personal_catalog, | |
| "tenant_catalogs": tenant_catalogs, |
| volumes: | ||
| # Mount the shared /home directory to access all users' credentials | ||
| # This allows the MCP server to dynamically read any user's credentials | ||
| # from /home/{username}/.berdl_minio_credentials | ||
| # In K8s: mount the parent directory or use a shared volume | ||
| - users_home:/home:ro |
There was a problem hiding this comment.
The comment says the MCP server reads /home/{username}/.berdl_minio_credentials, but this PR introduces .berdl_polaris_credentials and doesn’t create a .berdl_minio_credentials file. Please update the comment. Also, mounting the entire /home volume into the MCP server can expose cached secrets (e.g., Polaris client_secret) across users; prefer fetching per-request from the governance API or mounting only the minimal required path/file.
| volumes: | |
| # Mount the shared /home directory to access all users' credentials | |
| # This allows the MCP server to dynamically read any user's credentials | |
| # from /home/{username}/.berdl_minio_credentials | |
| # In K8s: mount the parent directory or use a shared volume | |
| - users_home:/home:ro |
| | Configuration | Catalog | Example Namespace | | ||
| |--------------|---------|-------------------| | ||
| | `create_namespace_if_not_exists(spark, "analysis")` | `my` (personal) | `my.analysis` | | ||
| | `create_namespace_if_not_exists(spark, "research", tenant_name="kbase")` | `kbase` (tenant) | `kbase.research` | |
There was a problem hiding this comment.
These examples show create_namespace_if_not_exists(...) returning my./. without passing iceberg=True, but the implementation still defaults to the Delta/Hive flow unless iceberg=True is provided. Either update the examples to include iceberg=True or change create_namespace_if_not_exists’s default/auto-detection so the docs match behavior.
| # Personal namespace (same call) | ||
| ns = create_namespace_if_not_exists( | ||
| spark, "analysis" | ||
| ) | ||
| # Returns: "my.analysis" |
There was a problem hiding this comment.
This section says the create_namespace_if_not_exists() call is unchanged and returns my./., but the current implementation requires iceberg=True to use the Polaris/Iceberg flow. Please update these examples (add iceberg=True) or adjust the function default so the guide matches actual behavior.
| # Service names use the pattern: spark-notebook-{CI_KBASE_USERNAME} | ||
| # Update these keys if you change the usernames in .env | ||
| spark-notebook-tgu2: | ||
| # image: ghcr.io/berdatalakehouse/spark_notebook:main | ||
| # platform: linux/amd64 |
There was a problem hiding this comment.
The notebook service key was renamed to spark-notebook-tgu2, but this compose file still contains other references to the old hostname "spark-notebook" (e.g., SERVICE_TEMPLATE and SPARK_CONNECT_URL_TEMPLATE). As-is, those services will likely fail to connect/resolve the notebook container; please update the remaining references to match the new naming pattern (spark-notebook-${CI_KBASE_USERNAME}).
…o feature/polaris
No description provided.