Add Nessie catalog support in docs #4180

somratdutta · 2025-07-28T20:44:15Z

Summary

Checklist

Delete items not relevant to your PR
URL changes should add a redirect to the old URL via https://github.com/ClickHouse/clickhouse-docs/blob/main/docusaurus.config.js
If adding a new integration page, also add an entry to the integrations list here: https://github.com/ClickHouse/clickhouse-docs/blob/main/docs/integrations/index.mdx

vercel · 2025-07-28T20:44:19Z

@somratdutta is attempting to deploy a commit to the ClickHouse Team on Vercel.

A member of the Team first needs to authorize it.

somratdutta · 2025-07-28T20:49:30Z

Testing Instructions

This PR depends on a recently merged fix that is not yet available as a Docker image. Below are comprehensive testing instructions to validate the changes locally using Nessie as the REST catalog backend.

Prerequisites

Download the appropriate ClickHouse binary from the build artifacts based on your platform. For macOS on Apple Silicon, use the arm_darwin build.

Environment Setup

Initialize ClickHouse Server
```
chmod +x clickhouse
./clickhouse server
```

Deploy Supporting Infrastructure

Create a docker-compose.yaml file with the following configuration:

services:
  jupyter:
    image: quay.io/jupyter/pyspark-notebook:2024-10-14
    depends_on:
      minio:
        condition: service_healthy
    command: start-notebook.sh --NotebookApp.token=''
    volumes:
      - ./notebooks:/home/jovyan/examples/
    ports:
      - "8888:8888"

  nessie:
    image: ghcr.io/projectnessie/nessie:latest
    ports:
      - "19120:19120"
    environment:
      - nessie.version.store.type=IN_MEMORY
      - nessie.catalog.default-warehouse=warehouse
      - nessie.catalog.warehouses.warehouse.location=s3://my-bucket/
      - nessie.catalog.service.s3.default-options.endpoint=http://minio:9000/
      - nessie.catalog.service.s3.default-options.access-key=urn:nessie-secret:quarkus:nessie.catalog.secrets.access-key
      - nessie.catalog.service.s3.default-options.path-style-access=true
      - nessie.catalog.service.s3.default-options.auth-type=STATIC
      - nessie.catalog.secrets.access-key.name=admin
      - nessie.catalog.secrets.access-key.secret=password
      - nessie.catalog.service.s3.default-options.region=us-east-1
      - nessie.server.authentication.enabled=false

  minio:
    image: quay.io/minio/minio
    ports:
      - "9002:9000"
      - "9003:9001"
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
      - MINIO_REGION=us-east-1
    healthcheck:
      test: ["CMD", "mc", "ready", "local"]
      interval: 5s
      timeout: 10s
      retries: 5
      start_period: 30s
    entrypoint: >
      /bin/sh -c "
      minio server /data --console-address ':9001' &
      sleep 10;
      mc alias set myminio http://localhost:9000 admin password;
      mc mb myminio/my-bucket --ignore-existing;
      tail -f /dev/null"

Data Ingestion via PySpark

Create the notebook notebooks/PySpark-Nessie.ipynb to establish test data using PySpark with Nessie and Apache Iceberg:

from pyspark.sql import SparkSession

# Initialize SparkSession with Nessie, Iceberg, and S3 configuration
spark = (
    SparkSession.builder.appName("Nessie-Iceberg-PySpark")
    .config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0,software.amazon.awssdk:bundle:2.24.8,software.amazon.awssdk:url-connection-client:2.24.8')
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.catalog.nessie", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.nessie.uri", "http://nessie:19120/iceberg/main/")
    .config("spark.sql.catalog.nessie.warehouse", "s3://my-bucket/")
    .config("spark.sql.catalog.nessie.type", "rest")
    .getOrCreate()
)

# Create a namespace in Nessie
spark.sql("CREATE NAMESPACE IF NOT EXISTS nessie.demo").show()

# Create a table in the `nessie.demo` namespace using Iceberg
spark.sql(
    """
    CREATE TABLE IF NOT EXISTS nessie.demo.sample_table (
        id BIGINT,
        name STRING
    ) USING iceberg
    """
).show()

# Insert data into the sample_table
spark.sql(
    """
    INSERT INTO nessie.demo.sample_table VALUES
    (1, 'Alice'),
    (2, 'Bob')
    """
).show()

# Query the data from the table
spark.sql("SELECT * FROM nessie.demo.sample_table").show()

# Stop the Spark session
spark.stop()

Integration Testing

After executing the notebook, connect to ClickHouse and validate the DataLakeCatalog integration with Nessie:

./clickhouse client

Execute the following SQL commands to verify functionality:

-- Enable experimental Iceberg support
SET allow_experimental_database_iceberg = 1;

-- Configure DataLakeCatalog with Nessie REST catalog backend
CREATE DATABASE demo 
ENGINE = DataLakeCatalog('http://localhost:19120/iceberg', 'admin', 'password') 
SETTINGS 
    catalog_type = 'rest', 
    storage_endpoint = 'http://localhost:9002/my-bucket', 
    warehouse = 'warehouse';

-- Verify table discovery
SHOW TABLES FROM demo;

-- Validate data retrieval
SELECT * FROM demo.`demo.sample_table`;

Expected Results

The integration should successfully:

Discover the Iceberg table: demo.sample_table

Query the table returning the test dataset:

┌─id─┬─name──┐
│  2 │ Bob   │
│  1 │ Alice │
└────┴───────┘

add nessie catalog support

a491854

somratdutta requested review from a team as code owners July 28, 2025 20:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Nessie catalog support in docs #4180

Add Nessie catalog support in docs #4180

Uh oh!

somratdutta commented Jul 28, 2025 •

edited

Loading

Uh oh!

vercel bot commented Jul 28, 2025

Uh oh!

somratdutta commented Jul 28, 2025

Uh oh!

Uh oh!

Add Nessie catalog support in docs #4180

Are you sure you want to change the base?

Add Nessie catalog support in docs #4180

Uh oh!

Conversation

somratdutta commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Uh oh!

vercel bot commented Jul 28, 2025

Uh oh!

somratdutta commented Jul 28, 2025

Testing Instructions

Prerequisites

Environment Setup

Data Ingestion via PySpark

Integration Testing

Expected Results

Uh oh!

Uh oh!

somratdutta commented Jul 28, 2025 •

edited

Loading