Skip to content

Calvinjmin/databricks-sdk-cpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Databricks C++ SDK

Latest Release License: MIT Documentation

A C++ SDK for Databricks, providing an interface for interacting with Databricks services.

Latest Release: v0.2.4

Author: Calvin Min ([email protected])


Table of Contents

Requirements

  • C++17 compatible compiler (GCC 7+, Clang 5+, MSVC 2017+)
  • CMake 3.14 or higher
  • ODBC Driver Manager:
    • Linux/macOS: unixODBC (brew install unixodbc or apt-get install unixodbc-dev)
    • Windows: Built-in ODBC Driver Manager
  • Simba Spark ODBC Driver: Download from Databricks

ODBC Driver Setup

After installing the requirements above, you need to configure the ODBC driver:

Linux/macOS

  1. Install unixODBC (if not already installed):

    # macOS
    brew install unixodbc
    
    # Ubuntu/Debian
    sudo apt-get install unixodbc unixodbc-dev
    
    # RedHat/CentOS
    sudo yum install unixODBC unixODBC-devel
  2. Download and install Simba Spark ODBC Driver from Databricks Downloads

  3. Verify driver installation:

    odbcinst -q -d

    You should see "Simba Spark ODBC Driver" in the output.

  4. If driver is not found, check ODBC configuration locations:

    odbcinst -j

    Ensure the driver is registered in the odbcinst.ini file shown.

Windows

  1. Download and run the Simba Spark ODBC Driver installer from Databricks Downloads
  2. The installer will automatically register the driver with Windows ODBC Driver Manager

Using Alternative ODBC Drivers

If you prefer to use a different ODBC driver, you can configure it:

databricks::SQLConfig sql;
sql.odbc_driver_name = "Your Driver Name Here"; // Must match driver name from odbcinst -q -d

auto client = databricks::Client::Builder()
    .with_environment_config()
    .with_sql(sql)
    .build();

Automated Setup Check

Run the setup checker script to verify your ODBC configuration:

./scripts/check_odbc_setup.sh

This will verify:

  • unixODBC installation
  • ODBC configuration files
  • Installed ODBC drivers (including Simba Spark)
  • Library paths

Installation

Option 1: CMake FetchContent (Recommended - Direct from GitHub)

Add to your CMakeLists.txt:

include(FetchContent)

FetchContent_Declare(
  databricks_sdk
  GIT_REPOSITORY https://github.com/calvinjmin/databricks-sdk-cpp.git
  GIT_TAG main  # latest tag or declare a specific version like 0.1.0
)

FetchContent_MakeAvailable(databricks_sdk)
target_link_libraries(my_app PRIVATE databricks_sdk::databricks_sdk)

Advantages: No separate installation step, always gets the exact version you specify.

Option 2: vcpkg

Once published to vcpkg (submission in progress), install with:

vcpkg install databricks-sdk-cpp

Then use in your CMake project:

find_package(databricks_sdk CONFIG REQUIRED)
target_link_libraries(my_app PRIVATE databricks_sdk::databricks_sdk)

For maintainers: See dev-docs/VCPKG_SUBMISSION.md for the complete submission guide.

Option 3: Manual Build and Install

# Clone and build
git clone https://github.com/calvinjmin/databricks-sdk-cpp.git
cd databricks-sdk-cpp
mkdir build && cd build
cmake ..
cmake --build .

# Install (requires sudo on Linux/macOS)
sudo cmake --install .

Then use in your project:

find_package(databricks_sdk REQUIRED)
target_link_libraries(my_app PRIVATE databricks_sdk::databricks_sdk)

Building from Source

# Create build directory
mkdir build && cd build

# Configure
cmake ..

# Build
cmake --build .

# Install (optional)
sudo cmake --install .

Build Options

  • BUILD_EXAMPLES (default: ON) - Build example applications
  • BUILD_TESTS (default: OFF) - Build unit tests
  • BUILD_SHARED_LIBS (default: ON) - Build as shared library

Example:

cmake -DBUILD_EXAMPLES=ON -DBUILD_TESTS=ON ..

Quick Start

Configuration

The SDK uses a modular configuration system with separate concerns for authentication, SQL settings, and connection pooling. The Builder pattern provides a clean API for constructing clients.

Configuration Structure

The SDK separates configuration into four distinct concerns:

  • AuthConfig: Core authentication (host, token, timeout) - shared across all Databricks features
  • SQLConfig: SQL-specific settings (http_path, ODBC driver name)
  • PoolingConfig: Optional connection pooling settings (enabled, min/max connections)
  • RetryConfig: Optional automatic retry settings (enabled, max attempts, backoff strategy)

This modular design allows you to:

  • Share AuthConfig across different Databricks service clients (SQL, Workspace, Delta, etc.)
  • Configure only what you need
  • Mix automatic and explicit configuration

Option 1: Automatic Configuration (Recommended)

The SDK automatically loads configuration from ~/.databrickscfg or environment variables:

#include <databricks/client.h>

int main() {
    // Load from ~/.databrickscfg or environment variables
    auto client = databricks::Client::Builder()
        .with_environment_config()
        .build();
    
    auto results = client.query("SELECT * FROM my_table LIMIT 10");
    
    return 0;
}

Configuration Precedence (highest to lowest):

  1. Profile file (~/.databrickscfg with [DEFAULT] section) - if complete, used exclusively
  2. Environment variables (DATABRICKS_HOST, DATABRICKS_TOKEN, DATABRICKS_HTTP_PATH) - only as fallback

Option 2: Profile File

Create ~/.databrickscfg:

[DEFAULT]
host = https://my-workspace.databricks.com
token = dapi1234567890abcdef
http_path = /sql/1.0/warehouses/abc123
# Alternative key name also supported:
# sql_http_path = /sql/1.0/warehouses/abc123

[production]
host = https://prod.databricks.com
token = dapi_prod_token
http_path = /sql/1.0/warehouses/prod123

Load specific profile:

auto client = databricks::Client::Builder()
    .with_environment_config("production")
    .build();

Option 3: Environment Variables Only

export DATABRICKS_HOST="https://my-workspace.databricks.com"
export DATABRICKS_TOKEN="dapi1234567890abcdef"
export DATABRICKS_HTTP_PATH="/sql/1.0/warehouses/abc123"
export DATABRICKS_TIMEOUT=120  # Optional

# Alternative variable names also supported:
# DATABRICKS_SERVER_HOSTNAME, DATABRICKS_ACCESS_TOKEN, DATABRICKS_SQL_HTTP_PATH

Option 4: Manual Configuration

#include <databricks/client.h>

int main() {
    // Configure authentication
    databricks::AuthConfig auth;
    auth.host = "https://my-workspace.databricks.com";
    auth.token = "dapi1234567890abcdef";
    auth.timeout_seconds = 60;

    // Configure SQL settings
    databricks::SQLConfig sql;
    sql.http_path = "/sql/1.0/warehouses/abc123";
    sql.odbc_driver_name = "Simba Spark ODBC Driver";

    // Build client
    auto client = databricks::Client::Builder()
        .with_auth(auth)
        .with_sql(sql)
        .build();

    // Execute a query
    auto results = client.query("SELECT * FROM my_table LIMIT 10");

    return 0;
}

Async Connection (Non-blocking)

#include <databricks/client.h>

int main() {
    // Build client without auto-connecting
    auto client = databricks::Client::Builder()
        .with_environment_config()
        .with_auto_connect(false)
        .build();

    // Start connection asynchronously
    auto connect_future = client.connect_async();

    // Do other work while connecting...

    // Wait for connection before querying
    connect_future.wait();
    auto results = client.query("SELECT current_timestamp()");

    return 0;
}

Connection Pooling (High Performance)

#include <databricks/client.h>

int main() {
    // Configure pooling
    databricks::PoolingConfig pooling;
    pooling.enabled = true;
    pooling.min_connections = 2;
    pooling.max_connections = 10;

    // Build client with pooling
    auto client = databricks::Client::Builder()
        .with_environment_config()
        .with_pooling(pooling)
        .build();

    // Query as usual - connections acquired/released automatically
    auto results = client.query("SELECT * FROM my_table");

    return 0;
}

Note: Multiple Clients with the same config automatically share the same pool!

Automatic Retry Logic (Reliability)

The SDK includes automatic retry logic with exponential backoff for transient failures:

#include <databricks/client.h>

int main() {
    // Configure retry behavior
    databricks::RetryConfig retry;
    retry.enabled = true;                 // Enable retries (default: true)
    retry.max_attempts = 5;               // Retry up to 5 times (default: 3)
    retry.initial_backoff_ms = 200;       // Start with 200ms delay (default: 100ms)
    retry.backoff_multiplier = 2.0;       // Double delay each retry (default: 2.0)
    retry.max_backoff_ms = 10000;         // Cap at 10 seconds (default: 10000ms)
    retry.retry_on_timeout = true;        // Retry timeout errors (default: true)
    retry.retry_on_connection_lost = true;// Retry connection errors (default: true)

    // Build client with retry configuration
    auto client = databricks::Client::Builder()
        .with_environment_config()
        .with_retry(retry)
        .build();

    // Queries automatically retry on transient errors
    auto results = client.query("SELECT * FROM my_table");

    return 0;
}

Retry Features:

  • Exponential backoff with jitter to prevent thundering herd
  • Intelligent error classification - only retries transient errors:
    • Connection timeouts and network errors
    • Server unavailability (503, 502, 504)
    • Rate limiting (429 Too Many Requests)
  • Non-retryable errors fail immediately:
    • Authentication failures
    • SQL syntax errors
    • Permission denied errors
  • Enabled by default with sensible defaults
  • Works with connection pooling for maximum reliability

Disable Retries (if needed):

databricks::RetryConfig no_retry;
no_retry.enabled = false;

auto client = databricks::Client::Builder()
    .with_environment_config()
    .with_retry(no_retry)
    .build();

Mixing Configuration Approaches

The Builder pattern allows you to mix automatic and explicit configuration:

// Load auth from environment, but customize pooling
databricks::PoolingConfig pooling;
pooling.enabled = true;
pooling.max_connections = 20;

auto client = databricks::Client::Builder()
    .with_environment_config()  // Load auth + SQL from environment
    .with_pooling(pooling)       // Override pooling settings
    .build();

Or load auth separately from SQL settings:

// Load auth from profile, SQL from environment
databricks::AuthConfig auth = databricks::AuthConfig::from_profile("production");

databricks::SQLConfig sql;
sql.http_path = std::getenv("CUSTOM_HTTP_PATH");

auto client = databricks::Client::Builder()
    .with_auth(auth)
    .with_sql(sql)
    .build();

Accessing Configuration

You can access the modular configuration from any client:

auto client = databricks::Client::Builder()
    .with_environment_config()
    .build();

// Access configuration
const auto& auth = client.get_auth_config();
const auto& sql = client.get_sql_config();
const auto& pooling = client.get_pooling_config();

std::cout << "Connected to: " << auth.host << std::endl;
std::cout << "Using warehouse: " << sql.http_path << std::endl;

For a complete example, see examples/simple_query.cpp.

Running Examples

Setup Configuration

Examples automatically load configuration from either:

Option A: Profile File (recommended for development)

Create ~/.databrickscfg:

[DEFAULT]
host = https://your-workspace.databricks.com
token = your_databricks_token
http_path = /sql/1.0/warehouses/your_warehouse_id
# or: sql_http_path = /sql/1.0/warehouses/your_warehouse_id

Option B: Environment Variables (recommended for CI/CD)

export DATABRICKS_HOST="https://your-workspace.databricks.com"
export DATABRICKS_TOKEN="your_databricks_token"
export DATABRICKS_HTTP_PATH="/sql/1.0/warehouses/your_warehouse_id"

Or source a .env file:

set -a; source .env; set +a

Note: Profile configuration takes priority. Environment variables are used only as a fallback when no profile is configured.

Run Examples

After building with BUILD_EXAMPLES=ON, the following examples are available:

# SQL query execution with parameterized queries
./build/examples/simple_query

# Jobs API - list jobs, get details, trigger runs
./build/examples/jobs_example

# Compute API - manage clusters, create/start/stop/terminate
./build/examples/compute_example

Each example demonstrates a different aspect of the SDK:

  • simple_query: Basic SQL execution and parameterized queries
  • jobs_example: Jobs API for workflow automation
  • compute_example: Compute/Clusters API for cluster management

Performance Considerations

Connection Pooling Benefits

Connection pooling eliminates the overhead of creating new ODBC connections for each query:

  • Without pooling: 500-2000ms per query (includes connection time)
  • With pooling: 1-50ms per query (connection reused)
  • Recommended: Use pooling for applications making multiple queries

Async Operations Benefits

Async operations reduce perceived latency by performing work in the background:

  • Async connect: Start connecting while doing other initialization
  • Async query: Execute multiple queries concurrently
  • Combined with pooling: Maximum throughput for concurrent workloads

Best Practices

  1. Enable pooling via PoolingConfig for applications making multiple queries
  2. Use async operations when you can do other work while waiting
  3. Enable retry logic (on by default) for production reliability against transient failures
  4. Combine pooling + retries for maximum reliability and performance
  5. Size pools appropriately: min = typical concurrent load, max = peak load
  6. Share configs: Clients with identical configs automatically share pools
  7. Tune retry settings based on your workload:
    • High-throughput: Lower max_attempts (2-3) to fail fast
    • Critical operations: Higher max_attempts (5-7) for maximum reliability
    • Rate-limited APIs: Increase initial_backoff_ms and max_backoff_ms

Advanced Usage

Jobs API

Interact with Databricks Jobs to automate and orchestrate data workflows:

#include <databricks/jobs.h>
#include <databricks/config.h>

int main() {
    // Load auth configuration
    databricks::AuthConfig auth = databricks::AuthConfig::from_environment();

    // Create Jobs API client
    databricks::Jobs jobs(auth);

    // List all jobs
    auto job_list = jobs.list_jobs(25, 0);
    for (const auto& job : job_list) {
        std::cout << "Job: " << job.name
                  << " (ID: " << job.job_id << ")" << std::endl;
    }

    // Get specific job details
    auto job = jobs.get_job(123456789);
    std::cout << "Created by: " << job.creator_user_name << std::endl;

    // Trigger a job run with parameters
    std::map<std::string, std::string> params;
    params["date"] = "2024-01-01";
    params["environment"] = "production";

    uint64_t run_id = jobs.run_now(123456789, params);
    std::cout << "Started run: " << run_id << std::endl;

    return 0;
}

Key Features:

  • List jobs: Paginated listing with limit/offset support
  • Get job details: Retrieve full job configuration and metadata
  • Trigger runs: Start jobs with optional notebook parameters
  • Type-safe IDs: Uses uint64_t to correctly handle large job IDs
  • JSON parsing: Built on nlohmann/json for reliable parsing

API Compatibility:

  • Uses Jobs API 2.2 for full feature support including pagination
  • Timestamps returned as Unix milliseconds (uint64_t)
  • Automatic error handling with descriptive messages

For a complete example, see examples/jobs_example.cpp.

Compute/Clusters API

Manage Databricks compute clusters programmatically:

#include <databricks/compute/compute.h>
#include <databricks/core/config.h>

int main() {
    databricks::AuthConfig auth = databricks::AuthConfig::from_environment();
    databricks::Compute compute(auth);

    // List clusters
    auto clusters = compute.list_compute();
    for (const auto& c : clusters) {
        std::cout << c.cluster_name << " [" << c.state << "]" << std::endl;
    }

    // Lifecycle management
    compute.start_compute("cluster-id");
    compute.restart_compute("cluster-id");
    compute.terminate_compute("cluster-id");

    return 0;
}

Features:

  • List/get cluster details
  • Start, restart, and terminate clusters
  • Cluster state tracking (PENDING, RUNNING, TERMINATED, etc.)
  • Automatic HTTP retry logic with exponential backoff

HTTP Retry Logic:

All REST API calls automatically retry on transient failures (408, 429, 500-504) with exponential backoff (1s, 2s, 4s). This is built into the HTTP client and requires no configuration

Direct ConnectionPool Management

For advanced users who need fine-grained control over connection pools:

#include <databricks/connection_pool.h>

// Build config for pool
databricks::AuthConfig auth;
auth.host = "https://my-workspace.databricks.com";
auth.token = "dapi1234567890abcdef";

databricks::SQLConfig sql;
sql.http_path = "/sql/1.0/warehouses/abc123";

// Create and manage pool explicitly
databricks::ConnectionPool pool(auth, sql, 2, 10);
pool.warm_up();

// Acquire connections manually
{
    auto pooled_conn = pool.acquire();
    auto results = pooled_conn->query("SELECT...");
} // Connection returns to pool

// Monitor pool
auto stats = pool.get_stats();
std::cout << "Available: " << stats.available_connections << std::endl;

Note: Most users should use the Builder with PoolingConfig instead of direct pool management.

Documentation

The SDK includes comprehensive API documentation generated from code comments using Doxygen.

πŸ“š View Online Documentation

Live Documentation: https://calvinjmin.github.io/databricks-sdk-cpp/

The documentation is automatically built and published via GitHub Actions whenever changes are pushed to the main branch.

Generate Documentation Locally

# Install Doxygen
brew install doxygen  # macOS
# or: sudo apt-get install doxygen  # Linux

# Generate docs (creates docs/html/)
doxygen Doxyfile

# View in browser
open docs/html/index.html  # macOS
# or: xdg-open docs/html/index.html  # Linux

Documentation Features

The generated documentation includes:

  • Complete API Reference: All public classes, methods, and structs with detailed descriptions
  • README Integration: Full README displayed as the main landing page
  • Code Examples: Inline examples from header comments
  • Jobs API Documentation: Full reference for databricks::Jobs, Job, and JobRun types
  • SQL Client Documentation: Complete databricks::Client API reference
  • Connection Pooling: databricks::ConnectionPool and configuration types
  • Source Browser: Browse source code with syntax highlighting
  • Search Functionality: Quick search across all documentation
  • Cross-references: Navigate between related classes and methods

Quick Links (After Generation)

  • Main Page: docs/html/index.html - README and getting started
  • Classes: docs/html/annotated.html - All classes and structs
  • Jobs API: docs/html/classdatabricks_1_1_jobs.html - Jobs API reference
  • Client API: docs/html/classdatabricks_1_1_client.html - SQL client reference
  • Files: docs/html/files.html - Browse by file

Example: Viewing Jobs API Docs

# Generate and open Jobs API documentation
doxygen Doxyfile
open docs/html/classdatabricks_1_1_jobs.html

The documentation is automatically generated from the inline comments in header files, ensuring it stays synchronized with the code.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Support

For issues and questions, please open an issue on the GitHub repository.