Databricks and Spark: Basic usage guides #281

amotl · 2025-09-14T01:20:48Z

About

Continue adding integration guides from the community forum.

Preview

References

Consolidate Integration Guides I vs. II #102

coderabbitai · 2025-09-14T01:20:54Z

Walkthrough

Adds documentation pages for integrating CrateDB with Azure Databricks and Apache Spark: Databricks index and Azure Databricks tutorial, Spark index and Spark usage guide with JDBC examples in Python and Scala. No code or public APIs were changed.

Changes

Cohort / File(s)	Summary
Databricks integration docs `docs/integrate/databricks/index.md`, `docs/integrate/databricks/azure-databricks.md`	Introduces a Databricks section and an Azure Databricks tutorial covering workspace and cluster provisioning, installing the PostgreSQL JDBC driver (`org.postgresql:postgresql:42.2.23`), and Python/Scala notebook JDBC examples to connect to CrateDB.
Spark integration docs `docs/integrate/spark/index.md`, `docs/integrate/spark/usage.md`	Adds a Spark topic page and a usage guide describing prerequisites, single-node Spark setup (macOS), JDBC configuration, DataFrame read/transform/write examples (PySpark), and verification steps.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant D as Azure Databricks Workspace
  participant S as Spark Cluster
  participant C as CrateDB
  note over D,S: JDBC driver installed (org.postgresql:postgresql:42.2.23)

  U->>D: Open notebook (Python/Scala)
  D->>S: Submit Spark job with JDBC options (url, driver, dbtable, user, password)
  S->>C: JDBC connect / SELECT
  C-->>S: Result set
  S-->>D: DataFrame results (e.g., head)
  U->>D: Save results via JDBC (mode=overwrite)
  D->>S: Write job -> JDBC INSERT/UPSERT
  C-->>S: Acknowledgment

sequenceDiagram
  autonumber
  actor U as User
  participant SP as Local Spark (PySpark)
  participant C as CrateDB

  U->>SP: Start PySpark with JDBC jar on classpath
  SP->>C: JDBC read (url, table)
  C-->>SP: Rows
  SP->>SP: Transform (filter, aggregate)
  SP->>C: JDBC write (mode=overwrite)
  C-->>SP: Ack

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Layout NG: Folder structure and naming things #236 — reorganizes integrate/docs structure and adds integration pages; likely related to adding Databricks/Spark docs.

Suggested labels

new content, guidance

Suggested reviewers

hammerhead
kneth
surister

Poem

I hopped through docs with brisk delight,
Linked Spark and Databricks day and night.
JDBC strings tied nice and neat,
CrateDB and clusters now can meet.
🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title clearly encapsulates the primary change by indicating the addition of basic usage guides for both Databricks and Spark integrations, making it concise, specific, and reflective of the pull request’s main purpose.
Description Check	✅ Passed	The description outlines that the pull request adds integration guides from the community forum and provides preview links for the newly added Databricks and Spark documentation, directly aligning with the changes in the pull request.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch spark

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

amotl · 2025-09-29T18:27:32Z

docs/integrate/spark/index.md

+:::{rubric} Learn
+:::


Two more usage guides by @patrickFuerst and @ckurze could be added here.

https://github.com/crate/crate-ml/blob/master/docs/ml_on_the_edge/apache_spark/Readme.md

https://github.com/crate/pre-sales/blob/master/Demos/PySpark/pyspark_write_data_frame_to_CrateDB_jdbc.py

@coderabbitai: Please carry this over into a dedicated issue.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Create a GitHub issue for adding the two usage guides to Spark documentation gh issue create \ --title "Add additional Spark usage guides to documentation" \ --body "## Context Two additional usage guides for Apache Spark with CrateDB should be added to the Spark integration documentation page. ## Usage Guides to Add 1. Machine Learning on the Edge with Apache Spark: https://github.com/crate/crate-ml/blob/master/docs/ml_on_the_edge/apache_spark/Readme.md (by @patrickFuerst) 2. PySpark Write DataFrame to CrateDB via JDBC: https://github.com/crate/pre-sales/blob/master/Demos/PySpark/pyspark_write_data_frame_to_CrateDB_jdbc.py (by @ckurze) ## Target File \`docs/integrate/spark/index.md\` ## References - Pull Request: https://github.com/crate/cratedb-guide/pull/281 - Comment: https://github.com/crate/cratedb-guide/pull/281#discussion_r2388862038 - Requested by: @amotl" \ --assignee amotl

Length of output: 49

@amotl, I've created issue #352 to track adding those two Spark usage guides to the documentation. The issue includes links to both guides and references back to this PR and comment.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

docs/integrate/spark/usage.md (2)
8-12: Consider adding Java to the Prerequisites section.

Since the setup instructions (line 44) require Java (and Scala) to run Apache Spark, it would be clearer to mention Java as a prerequisite upfront rather than only in the setup section.

Apply this diff:
 Prerequisites:
 
 1. Running instance of [CrateDB](https://console.cratedb.cloud/)
 2. Python 3.x
+3. Java 11 or later
68-68: Align PostgreSQL JDBC driver versions across documentation.

Currently, spark/usage.md references 42.6.0, databricks/azure-databricks.md uses 42.2.23, and tableau/index.md mentions 42.7.1. Choose a single, up-to-date version (for example, the latest stable 42.7.x) or document why different contexts require different drivers.
docs/integrate/databricks/azure-databricks.md (2)
21-36: Add language identifier to Python code block.

The code block is missing a language identifier, which helps with syntax highlighting and accessibility.

Change the code fence from ``` to ```python.
-```
+```python
 crateUsername = "<username>"
 cratePassword = "<password>"
 postgresqlUrl = "jdbc:postgresql://<url-to-server>:5432/?sslmode=require";
46-62: Add language identifier to Scala code block.

The code block is missing a language identifier. Also note that the Scala example includes a useful fetchsize option (line 59) that could be beneficial to mention in the documentation text or in the Python example.

Change the code fence from ``` to ```scala.
-```
+```scala
 val crateUsername = "<username>"
 val cratePassword = "<password>"
 val postgresqlUrl = "jdbc:postgresql://<url-to-server>:5432/?sslmode=require";
Note: The static analysis tool flagged line 50 as a "bare URL," but this is a false positive—it's correctly used within a string assignment.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e89b372 and eeb50f9.

📒 Files selected for processing (4)

docs/integrate/databricks/azure-databricks.md (1 hunks)
docs/integrate/databricks/index.md (1 hunks)
docs/integrate/spark/index.md (1 hunks)
docs/integrate/spark/usage.md (1 hunks)

🧰 Additional context used

🪛 markdownlint-cli2 (0.18.1)

docs/integrate/databricks/azure-databricks.md

50-50: Bare URL used

(MD034, no-bare-urls)

docs/integrate/spark/index.md

50-50: Bare URL used

(MD034, no-bare-urls)

🔇 Additional comments (3)

docs/integrate/databricks/index.md (1)

1-53: LGTM! Well-structured Databricks integration index.

The documentation index is well-organized with clear sections (About, Related, Learn) and proper cross-references to Apache Spark and the Azure Databricks tutorial.

docs/integrate/spark/index.md (1)

1-65: LGTM! Well-structured Spark integration index.

The documentation index is well-organized with clear sections and proper cross-references to Databricks and usage guides. The structure mirrors the Databricks index page nicely for consistency.

Note: The static analysis tool flagged line 50 as a "bare URL," but this is a false positive—the URL is correctly used within a grid-item-card's :link: property, not as bare text.

docs/integrate/spark/usage.md (1)

79-120: LGTM! Clear and correct PySpark examples.

The code examples demonstrate proper JDBC connectivity with CrateDB, including:

Correct driver configuration and classpath setup

Proper DataFrame operations (filtering, aggregation)

Appropriate write mode (overwrite) for the use case

Good practice of stopping the Spark session

The explanatory text is clear and helpful.

coderabbitai · 2025-09-30T16:52:33Z

docs/integrate/spark/usage.md

+INSERT INTO sensor_data (id, value, machine)
+SELECT
+    id,
+    floor(random()*100),
+    CASE floor(random() * 3)
+        WHEN 0 THEN 'machine 1'
+        WHEN 1 THEN 'machine 2'
+        WHEN 2 THEN 'machine 3'
+        ELSE 'machine 4'
+   END AS machine
+FROM 
+  generate_series(1, 100) id;
+```


⚠️ Potential issue | 🟡 Minor

Unreachable ELSE clause in CASE statement.

Line 30: floor(random() * 3) generates only values 0, 1, or 2, so the ELSE clause assigning 'machine 4' (line 34) will never execute.

If three machines are intended, remove the ELSE clause:

CASE floor(random() * 3) WHEN 0 THEN 'machine 1' WHEN 1 THEN 'machine 2' WHEN 2 THEN 'machine 3' - ELSE 'machine 4' END AS machine

Or if four machines are intended, use floor(random() * 4):

- CASE floor(random() * 3) + CASE floor(random() * 4) WHEN 0 THEN 'machine 1' WHEN 1 THEN 'machine 2' WHEN 2 THEN 'machine 3' ELSE 'machine 4' END AS machine

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

INSERT INTO sensor_data (id, value, machine)

SELECT

id,

floor(random()*100),

CASE floor(random() * 3)

WHEN 0 THEN 'machine 1'

WHEN 1 THEN 'machine 2'

WHEN 2 THEN 'machine 3'

ELSE 'machine 4'

END AS machine

FROM

generate_series(1, 100) id;

```

INSERT INTO sensor_data (id, value, machine)

SELECT

id,

floor(random()*100),

CASE floor(random() * 4)

WHEN 0 THEN 'machine 1'

WHEN 1 THEN 'machine 2'

WHEN 2 THEN 'machine 3'

ELSE 'machine 4'

END AS machine

FROM

generate_series(1, 100) id;

🤖 Prompt for AI Agents

In docs/integrate/spark/usage.md around lines 26 to 38 the CASE ELSE branch is unreachable because floor(random() * 3) yields only 0, 1, or 2; either remove the ELSE and keep three WHEN branches for 'machine 1'..'machine 3' if you intend three machines, or change the random multiplier to 4 (floor(random() * 4)) and add a WHEN 3 THEN 'machine 4' branch so all four machine labels can be produced.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

docs/integrate/spark/usage.md (1)
26-39: Unreachable ELSE clause in CASE statement.

The CASE statement has a logic error: floor(random() * 3) generates only values 0, 1, or 2, so the ELSE clause assigning 'machine 4' (line 35) will never execute.

Choose one of these fixes:

Option 1: Keep three machines (remove ELSE):
     CASE floor(random() * 3)
         WHEN 0 THEN 'machine 1'
         WHEN 1 THEN 'machine 2'
         WHEN 2 THEN 'machine 3'
-        ELSE 'machine 4'
    END AS machine
Option 2: Support four machines (change multiplier):
-    CASE floor(random() * 3)
+    CASE floor(random() * 4)
         WHEN 0 THEN 'machine 1'
         WHEN 1 THEN 'machine 2'
         WHEN 2 THEN 'machine 3'
         ELSE 'machine 4'
    END AS machine

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eeb50f9 and 2d2b530.

📒 Files selected for processing (4)

docs/integrate/databricks/azure-databricks.md (1 hunks)
docs/integrate/databricks/index.md (1 hunks)
docs/integrate/spark/index.md (1 hunks)
docs/integrate/spark/usage.md (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

docs/integrate/databricks/azure-databricks.md
docs/integrate/databricks/index.md

🧰 Additional context used

🪛 markdownlint-cli2 (0.18.1)

docs/integrate/spark/index.md

50-50: Bare URL used

(MD034, no-bare-urls)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build docs

🔇 Additional comments (14)

docs/integrate/spark/index.md (6)

1-3: LGTM!

The dual label approach (spark)= and (apache-spark)= provides flexible cross-referencing options throughout the documentation.

5-24: LGTM!

The About section provides a clear, accurate description of Apache Spark's capabilities, covering distributed computing, multi-language support, and various data processing use cases.

26-32: LGTM!

The Related section correctly links to Databricks documentation, which is appropriate since Databricks is built on Apache Spark.

37-55: LGTM!

The Learn section provides helpful navigation to both the usage guide and practical examples. The grid-item-cards use appropriate link types (ref for internal, url for external).

Note: The static analysis warning about a "bare URL" at line 50 is a false positive—the URL is properly structured within the :link: directive, not bare text.

58-62: LGTM!

The toctree configuration is appropriate, with :hidden: keeping the navigation clean while still making the Usage page discoverable.

65-65: LGTM!

The external link reference correctly points to the official Apache Spark website.
docs/integrate/spark/usage.md (8)
1-13: LGTM!

The introduction clearly explains the value proposition of combining Apache Spark with CrateDB, and the prerequisites are accurate and complete.

18-24: LGTM!

The table schema is appropriate for the tutorial, with IF NOT EXISTS ensuring idempotent execution.

41-68: LGTM!

The setup instructions are comprehensive and include verification steps. The JAVA_HOME configuration is correctly documented for macOS users.

81-90: LGTM!

The PySpark JDBC configuration is correct for connecting to CrateDB via the PostgreSQL protocol. The code properly includes the JDBC driver JAR and uses appropriate connection options.

100-103: LGTM!

The DataFrame operations demonstrate appropriate PySpark patterns for filtering and aggregation that align with the sensor_data schema.

110-116: Verify the write mode syntax.

Line 115 uses .save(mode="overwrite"), but the standard PySpark syntax is .mode("overwrite").save(). Please verify this works as written or adjust to the more common pattern.
# Standard pattern:
grouped_df.write.format("jdbc") \
    .option(...) \
    .mode("overwrite") \
    .save()

# Or:
grouped_df.write.format("jdbc") \
    .option(...) \
    .save(mode="overwrite")  # Both may work, but .mode() is more common
105-108: LGTM!

The explanatory text provides clear guidance on writing results back to CrateDB and includes a helpful verification step with a SELECT query.

Also applies to: 118-128

134-143: LGTM!

The wrap-up section effectively summarizes the tutorial and provides relevant next steps with links to CrateDB Cloud and community resources.

coderabbitai · 2025-09-30T17:51:07Z

docs/integrate/spark/usage.md

+4. Finally, as CrateDB communicates with Spark via JDBC, download the [Postgres JDBC driver](https://jdbc.postgresql.org/download/) in your working directory.
+   In this usage guide, we use the `postgresql-42.6.0.jar` driver.


🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Verify JDBC driver version is current.

The guide references postgresql-42.6.0.jar from what appears to be 2023 content. Ensure this version is still appropriate or consider recommending users download the latest compatible version.

🌐 Web query:

What is the latest version of PostgreSQL JDBC driver compatible with Apache Spark and CrateDB?

💡 Result:

Short answer: Use the latest pgJDBC release (org.postgresql:postgresql v42.7.7 as of the pgjdbc releases page) — it is compatible with Spark (Spark simply requires the JDBC driver on the classpath) and meets CrateDB's requirement of pgjdbc 9.4.1209 or newer. [1][2][3]

Sources:

CrateDB JDBC compatibility: pgjdbc 9.4.1209 and above are compatible. [1]

pgjdbc releases (latest release listing shows v42.7.7). [2]

Apache Spark JDBC docs (Spark uses the PostgreSQL JDBC jar on the classpath; no special version restriction). [3]

Recommend using latest PostgreSQL JDBC driver version
docs/integrate/spark/usage.md (lines 69–70): replace the hard-coded postgresql-42.6.0.jar with the most recent release (e.g., postgresql-42.7.7.jar as of Sept 2025) or instruct users to download the latest version available.

Already tracking this per GH-351?

amotl mentioned this pull request Sep 17, 2025

Consolidate Integration Guides I vs. II #102

Open

amotl commented Sep 29, 2025

View reviewed changes

amotl force-pushed the spark branch from d39999f to eeb50f9 Compare September 30, 2025 09:40

amotl changed the title ~~Databricks and Spark: Basic connectivity tutorials~~ Databricks and Spark: Basic usage guides Sep 30, 2025

amotl marked this pull request as ready for review September 30, 2025 16:49

coderabbitai bot mentioned this pull request Sep 30, 2025

Add additional Spark usage guides to documentation #352

Open

coderabbitai bot reviewed Sep 30, 2025

View reviewed changes

marijaselakovic and others added 4 commits September 30, 2025 19:41

Apache Spark: Getting started

e35b5b8

Databricks: Introduction to Azure Databricks with CrateDB

ada2f3d

Apache Spark: Add reference to example programs

1b82a8e

Spark: s/tutorial/usage/

6521422

amotl mentioned this pull request Sep 30, 2025

Update PostgreSQL JDBC driver to version 42.7.8 #351

Open

Spark/Databricks: Implement suggestions by CodeRabbit

2d2b530

amotl force-pushed the spark branch from eeb50f9 to 2d2b530 Compare September 30, 2025 17:46

coderabbitai bot reviewed Sep 30, 2025

View reviewed changes

amotl merged commit 69003c8 into main Sep 30, 2025
3 checks passed

amotl deleted the spark branch September 30, 2025 18:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Databricks and Spark: Basic usage guides #281

Databricks and Spark: Basic usage guides #281

amotl commented Sep 14, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Sep 14, 2025 •

edited

Loading

Uh oh!

amotl Sep 29, 2025

Uh oh!

amotl Sep 30, 2025

Uh oh!

coderabbitai bot Sep 30, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 30, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 30, 2025

Uh oh!

amotl Sep 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		4. Finally, as CrateDB communicates with Spark via JDBC, download the [Postgres JDBC driver](https://jdbc.postgresql.org/download/) in your working directory.
		In this usage guide, we use the `postgresql-42.6.0.jar` driver.

		:::{rubric} Learn
		:::

Databricks and Spark: Basic usage guides #281

Databricks and Spark: Basic usage guides #281

Conversation

amotl commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

About

Preview

References

Uh oh!

coderabbitai bot commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

amotl Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

amotl Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

amotl Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amotl commented Sep 14, 2025 •

edited

Loading

coderabbitai bot commented Sep 14, 2025 •

edited

Loading