Skip to content

a note that might be helpful with specific reasons not using a dedicated cluster and using serverless instead can be problematic #50

@nsubordin81

Description

@nsubordin81

Suggestion for Course Improvement: Cluster Configuration Guidance

I encountered an important issue during the course that could benefit future students if addressed with additional guidance.

While the course already encourages students to set up a single-node cluster, I experienced complications that highlight where explanation could be helpful around what parts of the course will be impacted if serverless compute is used.

During my learning process, I utilized both the Azure free trial and Databricks' 14-day free trial. My cluster setup encountered an issue where the specified compute resource size was unavailable, which I overlooked in the logs. As a result, I proceeded with the labs using serverless compute instead.

This decision created downstream problems when reaching the section on using Spark for data querying and extraction. The provided notebook, which uses Derar Alhussein's anonymous S3 bucket as a data source, failed to execute properly. I believe this occurred because the configuration settings for S3 resource access in Spark had been renamed or modified in the serverless compute runtime version.

Additionally, I encountered challenges with the section demonstrating global temporary view lifetimes, as these views require a dedicated cluster to reference the schema created within them.

Recommendation

While the course already instructs students to use their configured cluster for exercises, I recommend adding a specific note explaining:

  1. The importance of using the exact cluster configuration specified in the instructions
  2. Potential compatibility issues when using serverless compute instead of dedicated clusters
  3. How runtime versions may affect S3 connection parameters in the spark configuration

This additional guidance would help students quickly identify and resolve configuration-related issues, particularly when working with external data sources and schema-dependent operations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions