Skip to content

Latest commit

 

History

History
153 lines (119 loc) · 15.2 KB

File metadata and controls

153 lines (119 loc) · 15.2 KB

ADB workspace with external hive metastore

Credits to [email protected] and [email protected] for notebook logic for database initialization steps. This architecture will be deployed:

Get Started:

On your local machine, inside this folder of adb-external-hive-metastore:

  1. Clone the tf_azure_deployment repository to local.

  2. Supply with your terraform.tfvars file to overwrite default values accordingly. See inputs section below on optional/required variables.

  3. For step 2, variables for db_username and db_password, you can also use your environment variables: terraform will automatically look for environment variables with name format TF_VAR_xxxxx.

    export TF_VAR_db_username=yoursqlserveradminuser

    export TF_VAR_db_password=yoursqlserveradminpassword

  4. Init terraform and apply to deploy resources:

    terraform init

    terraform apply

Step 4 automatically completes 99% steps. The last 1% step is to manually trigger the deployed job to run once.

Go to databricks workspace - Job - run the auto-deployed job only once; this is to initialize the database with metastore schema.

alt text

Then you can verify in a notebook:

alt text

We can also check inside the sql db (metastore), we've successfully linked up cluster to external hive metastore and registered the table here:

alt text

Now you can config all other clusters to use this external metastore, using the same spark conf and env variables of cold start cluster.

Notes: Migrate from your existing managed metastore to external metastore

Refer to tutorial: https://kb.databricks.com/metastore/create-table-ddl-for-metastore.html

dbs = spark.catalog.listDatabases()
for db in dbs:
    f = open("your_file_name_{}.ddl".format(db.name), "w")
    tables = spark.catalog.listTables(db.name)
    for t in tables:
        DDL = spark.sql("SHOW CREATE TABLE {}.{}".format(db.name, t.name))
        f.write(DDL.first()[0])
        f.write("\n")
    f.close()

Module creates:

  • Resource group with random prefix
  • Tags, including Owner, which is taken from az account show --query user
  • VNet with public and private subnet
  • Databricks workspace
  • External Hive Metastore for ADB workspace
  • Private endpoint connection to external metastore

Requirements

Name Version
azurerm =2.83.0
databricks 0.3.10

Providers

Name Version
azurerm 2.83.0
databricks 0.3.10
external 2.1.0
random 3.1.0

Modules

No modules.

Resources

Name Type
azurerm_databricks_workspace.this resource
azurerm_key_vault.akv1 resource
azurerm_key_vault_access_policy.example resource
azurerm_key_vault_secret.hivepwd resource
azurerm_key_vault_secret.hiveurl resource
azurerm_key_vault_secret.hiveuser resource
azurerm_mssql_database.sqlmetastore resource
azurerm_mssql_server.metastoreserver resource
azurerm_mssql_server_extended_auditing_policy.mssqlpolicy resource
azurerm_mssql_virtual_network_rule.sqlservervnetrule resource
azurerm_network_security_group.this resource
azurerm_private_dns_zone.dnsmetastore resource
azurerm_private_dns_zone_virtual_network_link.metastorednszonevnetlink resource
azurerm_private_endpoint.sqlserverpe resource
azurerm_resource_group.this resource
azurerm_storage_account.sqlserversa resource
azurerm_subnet.plsubnet resource
azurerm_subnet.private resource
azurerm_subnet.public resource
azurerm_subnet.sqlsubnet resource
azurerm_subnet_network_security_group_association.private resource
azurerm_subnet_network_security_group_association.public resource
azurerm_virtual_network.sqlvnet resource
azurerm_virtual_network.this resource
databricks_cluster.coldstart resource
databricks_global_init_script.metastoreinit resource
databricks_job.metastoresetup resource
databricks_notebook.ddl resource
databricks_secret_scope.kv resource
random_string.naming resource
azurerm_client_config.current data source
databricks_current_user.me data source
databricks_node_type.smallest data source
databricks_spark_version.latest_lts data source
external_external.me data source

Inputs

Name Description Type Default Required
cold_start if true, will spin up a cluster to download hive jars to dbfs bool true no
db_password Database administrator password string n/a yes
db_username Database administrator username string n/a yes
dbfs_prefix n/a string "dbfs" no
no_public_ip n/a bool true no
private_subnet_endpoints n/a list [] no
rglocation n/a string "southeastasia" no
spokecidr n/a string "10.179.0.0/20" no
sqlvnetcidr n/a string "10.178.0.0/20" no
workspace_prefix n/a string "adb" no

Outputs

Name Description
arm_client_id n/a
arm_subscription_id n/a
arm_tenant_id n/a
azure_region n/a
databricks_azure_workspace_resource_id n/a
resource_group n/a
workspace_url n/a