Skip to content

Add Nessie catalog support in docs #4180

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

somratdutta
Copy link
Contributor

@somratdutta somratdutta commented Jul 28, 2025

Summary

Checklist

@somratdutta somratdutta requested review from a team as code owners July 28, 2025 20:44
Copy link

vercel bot commented Jul 28, 2025

@somratdutta is attempting to deploy a commit to the ClickHouse Team on Vercel.

A member of the Team first needs to authorize it.

@somratdutta
Copy link
Contributor Author

Testing Instructions

This PR depends on a recently merged fix that is not yet available as a Docker image. Below are comprehensive testing instructions to validate the changes locally using Nessie as the REST catalog backend.

Prerequisites

Download the appropriate ClickHouse binary from the build artifacts based on your platform. For macOS on Apple Silicon, use the arm_darwin build.

Environment Setup

  1. Initialize ClickHouse Server

    chmod +x clickhouse
    ./clickhouse server
  2. Deploy Supporting Infrastructure

    Create a docker-compose.yaml file with the following configuration:

    services:
      jupyter:
        image: quay.io/jupyter/pyspark-notebook:2024-10-14
        depends_on:
          minio:
            condition: service_healthy
        command: start-notebook.sh --NotebookApp.token=''
        volumes:
          - ./notebooks:/home/jovyan/examples/
        ports:
          - "8888:8888"
    
      nessie:
        image: ghcr.io/projectnessie/nessie:latest
        ports:
          - "19120:19120"
        environment:
          - nessie.version.store.type=IN_MEMORY
          - nessie.catalog.default-warehouse=warehouse
          - nessie.catalog.warehouses.warehouse.location=s3://my-bucket/
          - nessie.catalog.service.s3.default-options.endpoint=http://minio:9000/
          - nessie.catalog.service.s3.default-options.access-key=urn:nessie-secret:quarkus:nessie.catalog.secrets.access-key
          - nessie.catalog.service.s3.default-options.path-style-access=true
          - nessie.catalog.service.s3.default-options.auth-type=STATIC
          - nessie.catalog.secrets.access-key.name=admin
          - nessie.catalog.secrets.access-key.secret=password
          - nessie.catalog.service.s3.default-options.region=us-east-1
          - nessie.server.authentication.enabled=false
    
      minio:
        image: quay.io/minio/minio
        ports:
          - "9002:9000"
          - "9003:9001"
        environment:
          - MINIO_ROOT_USER=admin
          - MINIO_ROOT_PASSWORD=password
          - MINIO_REGION=us-east-1
        healthcheck:
          test: ["CMD", "mc", "ready", "local"]
          interval: 5s
          timeout: 10s
          retries: 5
          start_period: 30s
        entrypoint: >
          /bin/sh -c "
          minio server /data --console-address ':9001' &
          sleep 10;
          mc alias set myminio http://localhost:9000 admin password;
          mc mb myminio/my-bucket --ignore-existing;
          tail -f /dev/null"

Data Ingestion via PySpark

Create the notebook notebooks/PySpark-Nessie.ipynb to establish test data using PySpark with Nessie and Apache Iceberg:

from pyspark.sql import SparkSession

# Initialize SparkSession with Nessie, Iceberg, and S3 configuration
spark = (
    SparkSession.builder.appName("Nessie-Iceberg-PySpark")
    .config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0,software.amazon.awssdk:bundle:2.24.8,software.amazon.awssdk:url-connection-client:2.24.8')
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.catalog.nessie", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.nessie.uri", "http://nessie:19120/iceberg/main/")
    .config("spark.sql.catalog.nessie.warehouse", "s3://my-bucket/")
    .config("spark.sql.catalog.nessie.type", "rest")
    .getOrCreate()
)

# Create a namespace in Nessie
spark.sql("CREATE NAMESPACE IF NOT EXISTS nessie.demo").show()

# Create a table in the `nessie.demo` namespace using Iceberg
spark.sql(
    """
    CREATE TABLE IF NOT EXISTS nessie.demo.sample_table (
        id BIGINT,
        name STRING
    ) USING iceberg
    """
).show()

# Insert data into the sample_table
spark.sql(
    """
    INSERT INTO nessie.demo.sample_table VALUES
    (1, 'Alice'),
    (2, 'Bob')
    """
).show()

# Query the data from the table
spark.sql("SELECT * FROM nessie.demo.sample_table").show()

# Stop the Spark session
spark.stop()

Integration Testing

After executing the notebook, connect to ClickHouse and validate the DataLakeCatalog integration with Nessie:

./clickhouse client

Execute the following SQL commands to verify functionality:

-- Enable experimental Iceberg support
SET allow_experimental_database_iceberg = 1;

-- Configure DataLakeCatalog with Nessie REST catalog backend
CREATE DATABASE demo 
ENGINE = DataLakeCatalog('http://localhost:19120/iceberg', 'admin', 'password') 
SETTINGS 
    catalog_type = 'rest', 
    storage_endpoint = 'http://localhost:9002/my-bucket', 
    warehouse = 'warehouse';

-- Verify table discovery
SHOW TABLES FROM demo;

-- Validate data retrieval
SELECT * FROM demo.`demo.sample_table`;

Expected Results

The integration should successfully:

  • Discover the Iceberg table: demo.sample_table
  • Query the table returning the test dataset:
    ┌─id─┬─name──┐
    │  2 │ Bob   │
    │  1 │ Alice │
    └────┴───────┘
    

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant