Skip to content

Conversation

amotl
Copy link
Member

@amotl amotl commented Sep 14, 2025

Copy link

coderabbitai bot commented Sep 14, 2025

Walkthrough

Adds documentation pages for integrating CrateDB with Azure Databricks and Apache Spark: Databricks index and Azure Databricks tutorial, Spark index and Spark usage guide with JDBC examples in Python and Scala. No code or public APIs were changed.

Changes

Cohort / File(s) Summary
Databricks integration docs
docs/integrate/databricks/index.md, docs/integrate/databricks/azure-databricks.md
Introduces a Databricks section and an Azure Databricks tutorial covering workspace and cluster provisioning, installing the PostgreSQL JDBC driver (org.postgresql:postgresql:42.2.23), and Python/Scala notebook JDBC examples to connect to CrateDB.
Spark integration docs
docs/integrate/spark/index.md, docs/integrate/spark/usage.md
Adds a Spark topic page and a usage guide describing prerequisites, single-node Spark setup (macOS), JDBC configuration, DataFrame read/transform/write examples (PySpark), and verification steps.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant D as Azure Databricks Workspace
  participant S as Spark Cluster
  participant C as CrateDB
  note over D,S: JDBC driver installed (org.postgresql:postgresql:42.2.23)

  U->>D: Open notebook (Python/Scala)
  D->>S: Submit Spark job with JDBC options (url, driver, dbtable, user, password)
  S->>C: JDBC connect / SELECT
  C-->>S: Result set
  S-->>D: DataFrame results (e.g., head)
  U->>D: Save results via JDBC (mode=overwrite)
  D->>S: Write job -> JDBC INSERT/UPSERT
  C-->>S: Acknowledgment
Loading
sequenceDiagram
  autonumber
  actor U as User
  participant SP as Local Spark (PySpark)
  participant C as CrateDB

  U->>SP: Start PySpark with JDBC jar on classpath
  SP->>C: JDBC read (url, table)
  C-->>SP: Rows
  SP->>SP: Transform (filter, aggregate)
  SP->>C: JDBC write (mode=overwrite)
  C-->>SP: Ack
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Suggested labels

new content, guidance

Suggested reviewers

  • hammerhead
  • kneth
  • surister

Poem

I hopped through docs with brisk delight,
Linked Spark and Databricks day and night.
JDBC strings tied nice and neat,
CrateDB and clusters now can meet.
🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title clearly encapsulates the primary change by indicating the addition of basic usage guides for both Databricks and Spark integrations, making it concise, specific, and reflective of the pull request’s main purpose.
Description Check ✅ Passed The description outlines that the pull request adds integration guides from the community forum and provides preview links for the newly added Databricks and Spark documentation, directly aligning with the changes in the pull request.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch spark

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment on lines +37 to +38
:::{rubric} Learn
:::
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai: Please carry this over into a dedicated issue.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Create a GitHub issue for adding the two usage guides to Spark documentation

gh issue create \
  --title "Add additional Spark usage guides to documentation" \
  --body "## Context

Two additional usage guides for Apache Spark with CrateDB should be added to the Spark integration documentation page.

## Usage Guides to Add

1. Machine Learning on the Edge with Apache Spark: https://github.com/crate/crate-ml/blob/master/docs/ml_on_the_edge/apache_spark/Readme.md (by @patrickFuerst)
2. PySpark Write DataFrame to CrateDB via JDBC: https://github.com/crate/pre-sales/blob/master/Demos/PySpark/pyspark_write_data_frame_to_CrateDB_jdbc.py (by @ckurze)

## Target File

\`docs/integrate/spark/index.md\`

## References

- Pull Request: https://github.com/crate/cratedb-guide/pull/281
- Comment: https://github.com/crate/cratedb-guide/pull/281#discussion_r2388862038
- Requested by: @amotl" \
  --assignee amotl

Length of output: 49


@amotl, I've created issue #352 to track adding those two Spark usage guides to the documentation. The issue includes links to both guides and references back to this PR and comment.

@amotl amotl changed the title Databricks and Spark: Basic connectivity tutorials Databricks and Spark: Basic usage guides Sep 30, 2025
@amotl amotl marked this pull request as ready for review September 30, 2025 16:49
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
docs/integrate/spark/usage.md (2)

8-12: Consider adding Java to the Prerequisites section.

Since the setup instructions (line 44) require Java (and Scala) to run Apache Spark, it would be clearer to mention Java as a prerequisite upfront rather than only in the setup section.

Apply this diff:

 Prerequisites:
 
 1. Running instance of [CrateDB](https://console.cratedb.cloud/)
 2. Python 3.x
+3. Java 11 or later

68-68: Align PostgreSQL JDBC driver versions across documentation.

Currently, spark/usage.md references 42.6.0, databricks/azure-databricks.md uses 42.2.23, and tableau/index.md mentions 42.7.1. Choose a single, up-to-date version (for example, the latest stable 42.7.x) or document why different contexts require different drivers.

docs/integrate/databricks/azure-databricks.md (2)

21-36: Add language identifier to Python code block.

The code block is missing a language identifier, which helps with syntax highlighting and accessibility.

Change the code fence from ``` to ```python.

-```
+```python
 crateUsername = "<username>"
 cratePassword = "<password>"
 postgresqlUrl = "jdbc:postgresql://<url-to-server>:5432/?sslmode=require";

46-62: Add language identifier to Scala code block.

The code block is missing a language identifier. Also note that the Scala example includes a useful fetchsize option (line 59) that could be beneficial to mention in the documentation text or in the Python example.

Change the code fence from ``` to ```scala.

-```
+```scala
 val crateUsername = "<username>"
 val cratePassword = "<password>"
 val postgresqlUrl = "jdbc:postgresql://<url-to-server>:5432/?sslmode=require";

Note: The static analysis tool flagged line 50 as a "bare URL," but this is a false positive—it's correctly used within a string assignment.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e89b372 and eeb50f9.

📒 Files selected for processing (4)
  • docs/integrate/databricks/azure-databricks.md (1 hunks)
  • docs/integrate/databricks/index.md (1 hunks)
  • docs/integrate/spark/index.md (1 hunks)
  • docs/integrate/spark/usage.md (1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/integrate/databricks/azure-databricks.md

50-50: Bare URL used

(MD034, no-bare-urls)

docs/integrate/spark/index.md

50-50: Bare URL used

(MD034, no-bare-urls)

🔇 Additional comments (3)
docs/integrate/databricks/index.md (1)

1-53: LGTM! Well-structured Databricks integration index.

The documentation index is well-organized with clear sections (About, Related, Learn) and proper cross-references to Apache Spark and the Azure Databricks tutorial.

docs/integrate/spark/index.md (1)

1-65: LGTM! Well-structured Spark integration index.

The documentation index is well-organized with clear sections and proper cross-references to Databricks and usage guides. The structure mirrors the Databricks index page nicely for consistency.

Note: The static analysis tool flagged line 50 as a "bare URL," but this is a false positive—the URL is correctly used within a grid-item-card's :link: property, not as bare text.

docs/integrate/spark/usage.md (1)

79-120: LGTM! Clear and correct PySpark examples.

The code examples demonstrate proper JDBC connectivity with CrateDB, including:

  • Correct driver configuration and classpath setup
  • Proper DataFrame operations (filtering, aggregation)
  • Appropriate write mode (overwrite) for the use case
  • Good practice of stopping the Spark session

The explanatory text is clear and helpful.

Comment on lines +26 to +39
INSERT INTO sensor_data (id, value, machine)
SELECT
id,
floor(random()*100),
CASE floor(random() * 3)
WHEN 0 THEN 'machine 1'
WHEN 1 THEN 'machine 2'
WHEN 2 THEN 'machine 3'
ELSE 'machine 4'
END AS machine
FROM
generate_series(1, 100) id;
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unreachable ELSE clause in CASE statement.

Line 30: floor(random() * 3) generates only values 0, 1, or 2, so the ELSE clause assigning 'machine 4' (line 34) will never execute.

If three machines are intended, remove the ELSE clause:

     CASE floor(random() * 3)
         WHEN 0 THEN 'machine 1'
         WHEN 1 THEN 'machine 2'
         WHEN 2 THEN 'machine 3'
-        ELSE 'machine 4'
    END AS machine

Or if four machines are intended, use floor(random() * 4):

-    CASE floor(random() * 3)
+    CASE floor(random() * 4)
         WHEN 0 THEN 'machine 1'
         WHEN 1 THEN 'machine 2'
         WHEN 2 THEN 'machine 3'
         ELSE 'machine 4'
    END AS machine
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
INSERT INTO sensor_data (id, value, machine)
SELECT
id,
floor(random()*100),
CASE floor(random() * 3)
WHEN 0 THEN 'machine 1'
WHEN 1 THEN 'machine 2'
WHEN 2 THEN 'machine 3'
ELSE 'machine 4'
END AS machine
FROM
generate_series(1, 100) id;
```
INSERT INTO sensor_data (id, value, machine)
SELECT
id,
floor(random()*100),
CASE floor(random() * 4)
WHEN 0 THEN 'machine 1'
WHEN 1 THEN 'machine 2'
WHEN 2 THEN 'machine 3'
ELSE 'machine 4'
END AS machine
FROM
generate_series(1, 100) id;
🤖 Prompt for AI Agents
In docs/integrate/spark/usage.md around lines 26 to 38 the CASE ELSE branch is
unreachable because floor(random() * 3) yields only 0, 1, or 2; either remove
the ELSE and keep three WHEN branches for 'machine 1'..'machine 3' if you intend
three machines, or change the random multiplier to 4 (floor(random() * 4)) and
add a WHEN 3 THEN 'machine 4' branch so all four machine labels can be produced.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
docs/integrate/spark/usage.md (1)

26-39: Unreachable ELSE clause in CASE statement.

The CASE statement has a logic error: floor(random() * 3) generates only values 0, 1, or 2, so the ELSE clause assigning 'machine 4' (line 35) will never execute.

Choose one of these fixes:

Option 1: Keep three machines (remove ELSE):

     CASE floor(random() * 3)
         WHEN 0 THEN 'machine 1'
         WHEN 1 THEN 'machine 2'
         WHEN 2 THEN 'machine 3'
-        ELSE 'machine 4'
    END AS machine

Option 2: Support four machines (change multiplier):

-    CASE floor(random() * 3)
+    CASE floor(random() * 4)
         WHEN 0 THEN 'machine 1'
         WHEN 1 THEN 'machine 2'
         WHEN 2 THEN 'machine 3'
         ELSE 'machine 4'
    END AS machine
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eeb50f9 and 2d2b530.

📒 Files selected for processing (4)
  • docs/integrate/databricks/azure-databricks.md (1 hunks)
  • docs/integrate/databricks/index.md (1 hunks)
  • docs/integrate/spark/index.md (1 hunks)
  • docs/integrate/spark/usage.md (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • docs/integrate/databricks/azure-databricks.md
  • docs/integrate/databricks/index.md
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/integrate/spark/index.md

50-50: Bare URL used

(MD034, no-bare-urls)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build docs
🔇 Additional comments (14)
docs/integrate/spark/index.md (6)

1-3: LGTM!

The dual label approach (spark)= and (apache-spark)= provides flexible cross-referencing options throughout the documentation.


5-24: LGTM!

The About section provides a clear, accurate description of Apache Spark's capabilities, covering distributed computing, multi-language support, and various data processing use cases.


26-32: LGTM!

The Related section correctly links to Databricks documentation, which is appropriate since Databricks is built on Apache Spark.


37-55: LGTM!

The Learn section provides helpful navigation to both the usage guide and practical examples. The grid-item-cards use appropriate link types (ref for internal, url for external).

Note: The static analysis warning about a "bare URL" at line 50 is a false positive—the URL is properly structured within the :link: directive, not bare text.


58-62: LGTM!

The toctree configuration is appropriate, with :hidden: keeping the navigation clean while still making the Usage page discoverable.


65-65: LGTM!

The external link reference correctly points to the official Apache Spark website.

docs/integrate/spark/usage.md (8)

1-13: LGTM!

The introduction clearly explains the value proposition of combining Apache Spark with CrateDB, and the prerequisites are accurate and complete.


18-24: LGTM!

The table schema is appropriate for the tutorial, with IF NOT EXISTS ensuring idempotent execution.


41-68: LGTM!

The setup instructions are comprehensive and include verification steps. The JAVA_HOME configuration is correctly documented for macOS users.


81-90: LGTM!

The PySpark JDBC configuration is correct for connecting to CrateDB via the PostgreSQL protocol. The code properly includes the JDBC driver JAR and uses appropriate connection options.


100-103: LGTM!

The DataFrame operations demonstrate appropriate PySpark patterns for filtering and aggregation that align with the sensor_data schema.


110-116: Verify the write mode syntax.

Line 115 uses .save(mode="overwrite"), but the standard PySpark syntax is .mode("overwrite").save(). Please verify this works as written or adjust to the more common pattern.

# Standard pattern:
grouped_df.write.format("jdbc") \
    .option(...) \
    .mode("overwrite") \
    .save()

# Or:
grouped_df.write.format("jdbc") \
    .option(...) \
    .save(mode="overwrite")  # Both may work, but .mode() is more common

105-108: LGTM!

The explanatory text provides clear guidance on writing results back to CrateDB and includes a helpful verification step with a SELECT query.

Also applies to: 118-128


134-143: LGTM!

The wrap-up section effectively summarizes the tutorial and provides relevant next steps with links to CrateDB Cloud and community resources.

Comment on lines +69 to +70
4. Finally, as CrateDB communicates with Spark via JDBC, download the [Postgres JDBC driver](https://jdbc.postgresql.org/download/) in your working directory.
In this usage guide, we use the `postgresql-42.6.0.jar` driver.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Verify JDBC driver version is current.

The guide references postgresql-42.6.0.jar from what appears to be 2023 content. Ensure this version is still appropriate or consider recommending users download the latest compatible version.


🌐 Web query:

What is the latest version of PostgreSQL JDBC driver compatible with Apache Spark and CrateDB?

💡 Result:

Short answer: Use the latest pgJDBC release (org.postgresql:postgresql v42.7.7 as of the pgjdbc releases page) — it is compatible with Spark (Spark simply requires the JDBC driver on the classpath) and meets CrateDB's requirement of pgjdbc 9.4.1209 or newer. [1][2][3]

Sources:

  • CrateDB JDBC compatibility: pgjdbc 9.4.1209 and above are compatible. [1]
  • pgjdbc releases (latest release listing shows v42.7.7). [2]
  • Apache Spark JDBC docs (Spark uses the PostgreSQL JDBC jar on the classpath; no special version restriction). [3]

Recommend using latest PostgreSQL JDBC driver version
docs/integrate/spark/usage.md (lines 69–70): replace the hard-coded postgresql-42.6.0.jar with the most recent release (e.g., postgresql-42.7.7.jar as of Sept 2025) or instruct users to download the latest version available.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already tracking this per GH-351?

@amotl amotl merged commit 69003c8 into main Sep 30, 2025
3 checks passed
@amotl amotl deleted the spark branch September 30, 2025 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants