Integrate/Kafka: Rework section. Add Python tutorial. #263

amotl · 2025-08-22T17:33:37Z

About

Pulling new content about integrating with Apache Kafka. Thanks, @zolbatar!

Source: https://cratedb.gitbook.io/cratedb-docs/K3l1K4ZBSqj0AL16dbZi/drivers-and-integrations/data-sources/streaming-platforms/kafka-daryl
Preview: https://cratedb-guide--263.org.readthedocs.build/integrate/kafka/
Preview: https://cratedb-guide--263.org.readthedocs.build/integrate/kafka/docker-python.html

References

Improve general guidance aka. easy user journey #227

coderabbitai · 2025-08-22T17:33:45Z

Warning

Rate limit exceeded

@amotl has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 7 minutes and 35 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between dff6b1f and 125edb5.

📒 Files selected for processing (4)

docs/integrate/kafka/attic.md (1 hunks)
docs/integrate/kafka/docker-python.md (1 hunks)
docs/integrate/kafka/index.md (3 hunks)
docs/integrate/kafka/kafka-connect.md (2 hunks)

Walkthrough

Adds three Kafka documentation pages (attic list of managed offerings, Docker+Python ingestion tutorial, expanded Kafka index), small edits to the Kafka Connect page, and a docs/conf.py linkcheck_ignore entry. All changes are documentation-only; no runtime, API, or exported-public-entity changes.

Changes

Cohort / File(s)	Summary of changes
Kafka index rewrite `docs/integrate/kafka/index.md`	Rewritten and expanded Kafka landing page: new overview and CI badge, Benefits of CrateDB + Apache Kafka, Common Ingestion Options (Kafka Connect, custom consumers, stream processors), Typical use cases, Deployment options, Key design considerations, updated Learn grid (adds Docker+Python tutorial), and removal of several vendor reference links.
Docker + Python tutorial `docs/integrate/kafka/docker-python.md`	New end-to-end guide with Docker Compose (Kafka + CrateDB), topic creation and sample producers, `quick_consumer.py` using `confluent_kafka` to batch (≤500) messages and POST to CrateDB `/_bulk` with INSERT ... ON CONFLICT upserts, run/verification steps, and message-format notes.
Kafka Connect tweaks `docs/integrate/kafka/kafka-connect.md`	Title changed to “Using Confluent Kafka Connect”; `docker-compose` commands updated to `docker compose`; no semantic content changes.
Managed Kafka attic doc `docs/integrate/kafka/attic.md`	New orphan doc listing managed Kafka offerings (Aiven, MSK, Azure Kafka, Azure Event Hubs for Kafka, Confluent Cloud, DoubleCloud) with reference links and an overview link.
Docs config change `docs/conf.py`	Added `linkcheck_ignore` entry for GitHub Actions runs at `https://github.com/crate/cratedb-toolkit/actions/runs/.*` (alongside existing `crate-workbench` pattern).

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor User
    participant Kafka as Kafka cluster
    participant Consumer as Python consumer
    participant Crate as CrateDB HTTP API

    User->>Kafka: produce JSON messages to topic "sensors"
    loop Polling
        Consumer->>Kafka: poll()
        Kafka-->>Consumer: messages (JSON)
        Consumer->>Consumer: parse & batch (<=500)
    end
    alt Batch ready or shutdown
        Consumer->>Crate: POST /_bulk (INSERT ... ON CONFLICT ...)
        Crate-->>Consumer: 200 OK / errors
    else Idle
        Consumer->>Consumer: wait/backoff
    end
    note over Consumer,Crate: Batched idempotent upserts via HTTP bulk endpoint

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Naming things: Simplify directory layout at "Integrate" section #249 — Overlaps edits to docs/integrate/kafka/index.md (heading/anchors and navigation changes).
Layout NG: Folder structure and naming things #236 — Related restructuring of integrate section and toctree entries for Kafka pages.
ETL/CDC: Balancing, Layout / Visual and structural overhaul for individual "Integration" pages #241 — Related docs configuration changes (edits to docs/conf.py linkcheck patterns).

Suggested labels

standing-1200, new content, refactoring, guidance

Suggested reviewers

bmunkholm
kneth
surister

Poem

I hop through queues with tiny feet,
I batch my carrots nice and neat.
I sip the streams, then gently write—
Upsert the garden through the night.
Happy data, snug and bright. 🥕

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch kafka

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

docs/integrate/kafka/kafka-connect.md (2)
60-67: Fix invalid JSON payload (trailing comma).

The example JSON has a trailing comma after the last field, which makes it invalid and will break copy/paste usage.
 {
    "sensor_id":101,
    "timestamp":"2022-06-12T19:00:00Z",
    "temperature":22.5,
-   "humidity":60,
+   "humidity":60
 }
366-369: Fix schema name used in SELECT examples (should be doc, not crate).

crate is a user, not a schema. The default schema for user tables is doc. The current examples will fail as written.
-SELECT *
-FROM "crate"."sensortable"
+SELECT *
+FROM "doc"."sensortable"
 LIMIT 100;
-cr> SELECT * FROM crate.sensortable LIMIT 5;
+cr> SELECT * FROM doc.sensortable LIMIT 5;
Also applies to: 374-385

🧹 Nitpick comments (12)

docs/integrate/kafka/kafka-connect.md (4)
231-237: Use modern Docker Compose syntax consistently.

Prefer docker compose (v2) over docker-compose (v1) and keep it consistent throughout.
-docker-compose up -d
+docker compose up -d
-docker-compose ps
+docker compose ps
Also applies to: 243-245

200-209: Pin CrateDB image to a specific version for reproducibility.

Using latest risks drift and inconsistent results across runs. Pin to a known-good CrateDB release.
-  cratedb:
-    image: crate:latest
+  cratedb:
+    image: crate:5.5.0
If a different version is intended, replace with the exact tag you’ve validated.

261-281: Be explicit with schema and avoid ambiguous settings in the sink config.

Explicitly qualify the table with the doc schema to match CrateDB defaults.

Consider removing schema.pattern for sink use; it’s more typical on source connectors and can be confusing here.
   "config": {
     "connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
     "tasks.max": "1",
     "topics": "sensor-data-topic",
     "connection.url": "jdbc:postgresql://cratedb:5432/crate?user=crate&sslmode=disable",
     "dialect.name": "PostgreSqlDatabaseDialect",
     "auto.create": "true",
     "auto.evolve": "true",
     "insert.mode": "insert",
     "pk.mode": "none",
     "value.converter": "io.confluent.connect.avro.AvroConverter",
     "value.converter.schemas.enable": "true",
     "value.converter.schema.registry.url": "http://schema-registry:8081",
     "key.converter": "org.apache.kafka.connect.storage.StringConverter",
     "key.converter.schemas.enable": "false",
-    "table.name.format": "sensortable",
-    "schema.pattern": "doc"
+    "table.name.format": "doc.sensortable"
   }
108-113: Pin the Kafka Connect JDBC plugin version.

latest can introduce breaking changes. Pin to a validated version to keep the tutorial stable.
-  bash -c "confluent-hub install --no-prompt confluentinc/kafka-connect-jdbc:latest"
+  bash -c "confluent-hub install --no-prompt confluentinc/kafka-connect-jdbc:10.7.5"
Replace with the actual version you’ve tested.
docs/integrate/kafka/attic.md (1)

11-21: Double-check provider naming/links and clarify non-exhaustive list.

Names and URLs for managed services can change. A brief note like “selection, not exhaustive” is already present; consider adding the month/year to set expectations.

Example addition after the dropdown intro: “List verified August 2025.”
docs/integrate/kafka/docker-python.md (5)
32-38: Pin CrateDB image version for deterministic results.

Avoid latest in tutorials.
-  cratedb:
-    image: crate:latest
+  cratedb:
+    image: crate:5.5.0
50-56: Tighten wording and avoid bare URL.

Say “Admin UI” instead of “cloud UI” here.

Wrap bare URL in angle brackets to satisfy MD034.
-* CrateDB Admin UI: http://localhost:4200
+* CrateDB Admin UI: <http://localhost:4200>
@@
-The easiest way to do this is through the CrateDB cloud UI at http://localhost:4200 and execute this using the console:
+The easiest way is through the CrateDB Admin UI at <http://localhost:4200> and execute this using the console:
98-141: Consumer: commit strategy and robustness.

Current flow relies on default auto-commit timing and may reprocess on restart. Since writes are idempotent via ON CONFLICT, you can safely commit after a successful batch to reduce lag and speed recovery. Also log/handle HTTP exceptions.
 from confluent_kafka import Consumer
@@
 c = Consumer({
     "bootstrap.servers": "localhost:9092",
     "group.id": "demo",
     "auto.offset.reset": "earliest",
+    "enable.auto.commit": False,
 })
@@
 def insert_batch(rows):
@@
-    r = requests.post(SQL_ENDPOINT, json=body, timeout=10)
-    r.raise_for_status()
+    r = requests.post(SQL_ENDPOINT, json=body, timeout=10)
+    r.raise_for_status()
+    return True
@@
-        if len(batch) >= 500:
-            insert_batch(batch)
-            batch.clear()
+        if len(batch) >= 500:
+            if insert_batch(batch):
+                c.commit(asynchronous=False)
+            batch.clear()
 finally:
-    if batch: insert_batch(batch)
+    if batch:
+        if insert_batch(batch):
+            c.commit(asynchronous=False)
     c.close()
1-2: Anchor/name consistency with Guide conventions.

Per prior restructuring, main anchors live in integration index.md and detailed tutorials often use a -learn suffix. Consider renaming the ref to kafka-python-learn (or kafka-docker-python-learn) to stay consistent, and update the index toctree accordingly.

We learned this convention from earlier feedback and can apply it here for navigational consistency.

Also applies to: 104-109

144-149: Minor: suggest virtualenv note before pip install.

A short note to use a virtual environment helps keep user environments clean.

Example addition above the commands: “Optionally create and activate a virtual environment (e.g., python -m venv .venv && source .venv/bin/activate) before installing packages.”
docs/integrate/kafka/index.md (2)
41-43: Soften the “99% compatible” claim and clarify transaction note.

Claims with precise percentages can mislead and go stale. Recommend “largely compatible” and explicitly mention notable differences (e.g., transactions unsupported) without quantifying.
-:::{important}
-The Apache Kafka PostgreSQL connector is 99% compatible with CrateDB. CrateDB does not support transactions, but this only becomes notable when either ingestion to CrateDB is failing or there is some other connection issue.&#x20;
-:::
+:::{important}
+The Apache Kafka PostgreSQL connector is largely compatible with CrateDB. Note that CrateDB does not support transactions; this generally matters only during failures or connection issues that require retries or rollbacks.
+:::
84-85: Remove unnecessary escape in “Crate.io”.

The backslash before the dot isn’t needed.
-* **Managed CrateDB** – Crate\.io offers CrateDB Cloud, which can pair with either self-managed Kafka or managed Kafka services. This option reduces database operations to a minimum.
+* **Managed CrateDB** – Crate.io offers CrateDB Cloud, which can pair with either self-managed Kafka or managed Kafka services. This option reduces database operations to a minimum.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f05a2b2 and c4c1313.

📒 Files selected for processing (4)

docs/integrate/kafka/attic.md (1 hunks)
docs/integrate/kafka/docker-python.md (1 hunks)
docs/integrate/kafka/index.md (3 hunks)
docs/integrate/kafka/kafka-connect.md (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: amotl
PR: crate/cratedb-guide#238
File: docs/integrate/azure-functions/learn.rst:1-1
Timestamp: 2025-08-09T16:19:43.966Z
Learning: In the CrateDB Guide documentation, main integration anchors (e.g., `azure-functions`) are intentionally placed in the `index.md` files of their respective integration folders, while detailed tutorials use the `-learn` suffix (e.g., `azure-functions-learn`) in their `learn.rst` or `learn.md` files. This is a deliberate architectural pattern for the documentation restructuring.

📚 Learning: 2025-08-09T16:19:43.966Z

Learnt from: amotl
PR: crate/cratedb-guide#238
File: docs/integrate/azure-functions/learn.rst:1-1
Timestamp: 2025-08-09T16:19:43.966Z
Learning: In the CrateDB Guide documentation, main integration anchors (e.g., `azure-functions`) are intentionally placed in the `index.md` files of their respective integration folders, while detailed tutorials use the `-learn` suffix (e.g., `azure-functions-learn`) in their `learn.rst` or `learn.md` files. This is a deliberate architectural pattern for the documentation restructuring.

Applied to files:

docs/integrate/kafka/index.md

🪛 LanguageTool

docs/integrate/kafka/attic.md

[grammar] ~4-~4: There might be a mistake here.
Context: --- orphan: true --- :::{rubric} About ::: [Apache Kafka] is an open-source di...

(QB_NEW_EN)

[grammar] ~7-~7: There might be a mistake here.
Context: ...ributed event streaming platform used by thousands of companies for high-performa...

(QB_NEW_EN)

[grammar] ~8-~8: There might be a mistake here.
Context: ...nce data pipelines, streaming analytics, data integration, and mission-critical a...

(QB_NEW_EN)

[grammar] ~11-~11: There might be a mistake here.
Context: ...tions. :::{dropdown} Managed Kafka Several companies provide managed Kafka ...

(QB_NEW_EN)

[grammar] ~12-~12: There might be a mistake here.
Context: ...he [overview of managed Kafka offerings] for a more complete list). - [Aiven for...

(QB_NEW_EN)

[grammar] ~15-~15: There might be a mistake here.
Context: ...mplete list). - [Aiven for Apache Kafka] - [Amazon Managed Streaming for Apache Kafk...

(QB_NEW_EN)

[grammar] ~16-~16: There might be a mistake here.
Context: ...Managed Streaming for Apache Kafka (MSK)] - [Apache Kafka on Azure] - [Azure Event Hu...

(QB_NEW_EN)

[grammar] ~17-~17: There might be a mistake here.
Context: ...he Kafka (MSK)] - [Apache Kafka on Azure] - [Azure Event Hubs for Apache Kafka] - [Co...

(QB_NEW_EN)

[grammar] ~18-~18: There might be a mistake here.
Context: ...re] - [Azure Event Hubs for Apache Kafka] - [Confluent Cloud] - [DoubleCloud Managed ...

(QB_NEW_EN)

[grammar] ~19-~19: There might be a mistake here.
Context: ...ubs for Apache Kafka] - [Confluent Cloud] - [DoubleCloud Managed Service for Apache K...

(QB_NEW_EN)

docs/integrate/kafka/docker-python.md

[grammar] ~1-~1: There might be a mistake here.
Context: (kafka-docker-python)= # Using Kafka with Python This walkthroug...

(QB_NEW_EN)

[grammar] ~4-~4: There might be a mistake here.
Context: ...w to load data from a Kafka topic into a CrateDB table, using a Python consumer a...

(QB_NEW_EN)

[grammar] ~50-~50: There might be a mistake here.
Context: ... CrateDB Admin UI: http://localhost:4200 * Kafka broker (inside-compose hostname): ...

(QB_NEW_EN)

docs/integrate/kafka/index.md

[grammar] ~20-~20: There might be a mistake here.
Context: ...fix ``` :::{include} /_include/links.md ::: :::{div} sd-text-muted Apache Kafka...

(QB_NEW_EN)

[grammar] ~23-~23: There might be a mistake here.
Context: ...ude/links.md ::: :::{div} sd-text-muted Apache Kafka is a widely used open-sourc...

(QB_NEW_EN)

[grammar] ~24-~24: There might be a mistake here.
Context: ...uted event-store and streaming platform. ::: ## Overview [Apache Kafka] is a d...

(QB_NEW_EN)

[grammar] ~35-~35: There might be a mistake here.
Context: ...s, consumers, and CrateDB independently. * Near-real-time analytics – New events ...

(QB_NEW_EN)

[grammar] ~36-~36: There might be a mistake here.
Context: ...n, exposed via SQL to standard BI tools. * Operational resilience – Use Kafka as ...

(QB_NEW_EN)

[grammar] ~73-~73: There might be a mistake here.
Context: ...ts directly from queries over live data. * **ETL / ELT landing zone for downstream an...

(QB_NEW_EN)

[grammar] ~82-~82: There might be a mistake here.
Context: ...anage scaling, upgrades, and monitoring. * Managed Kafka services – Use a provide...

(QB_NEW_EN)

[grammar] ~84-~84: There might be a mistake here.
Context: ...providers as well. * Managed CrateDB – Crate.io offers CrateDB Cloud, which can pair...

(QB_NEW_EN)

[grammar] ~91-~91: There might be a mistake here.
Context: ...vice_id) to keep ordering where needed. * Table modelling in CrateDB – Choose pr...

(QB_NEW_EN)

[grammar] ~92-~92: There might be a mistake here.
Context: ...formance - define shard count per table. * Upserts & deduplication – Include a st...

(QB_NEW_EN)

[grammar] ~93-~93: There might be a mistake here.
Context: ...to make writes idempotent when possible. * Batching & back-pressure – Tune sink b...

(QB_NEW_EN)

[grammar] ~94-~94: There might be a mistake here.
Context: ...gest capacity while keeping latency low. * Schema & types – Keep payloads consist...

(QB_NEW_EN)

[grammar] ~95-~95: There might be a mistake here.
Context: ...imestamps/time zones, numerics, arrays). * Retention & replay – Kafka retention d...

(QB_NEW_EN)

[grammar] ~96-~96: There might be a mistake here.
Context: ...plan storage and compaction accordingly. * Observability – Monitor producer lag, ...

(QB_NEW_EN)

[grammar] ~101-~101: There might be a mistake here.
Context: ...ry latency. ## Learn more ::::{grid} 2 :gutter: 2 :::{grid-item-card} Tutorial...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)

docs/integrate/kafka/docker-python.md

119-119: Bare URL used

(MD034, no-bare-urls)

125-125: Bare URL used

(MD034, no-bare-urls)

144-144: Link and image reference definitions should be needed
Unused link or image reference definition: "cratedb and apache kafka"

(MD053, link-image-reference-definitions)

🔇 Additional comments (4)

docs/integrate/kafka/kafka-connect.md (1)

2-2: Title update aligns with content—good change.

Renaming to “Using Confluent Kafka Connect” better reflects the page scope and matches the new index card. Please ensure any inbound links or references in the repo that still use the old title are redirected or updated.

docs/integrate/kafka/attic.md (1)

1-3: Front-matter “orphan: true” is appropriate here.

This keeps the page out of nav while remaining linkable. LGTM.

docs/integrate/kafka/index.md (2)

14-15: Nice addition: CI badge for Ingestr + Kafka.

This adds useful social proof and a quick status check. LGTM.

101-109: Good cross-link: Docker+Python tutorial is clearly surfaced.

Card copy mirrors the tutorial opening, and the :link-type: ref matches the page anchor. Once you finalize the tutorial ref name (see suggestion in that file), remember to update this link if it changes.

docs/integrate/kafka/docker-python.md

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

docs/integrate/kafka/docker-python.md (1)
19-21: Fix Kafka advertised listeners to support host‐based Python clients

The current Compose setup advertises PLAINTEXT://kafka:9092, which works inside Docker but breaks a host‐based consumer connecting to localhost:9092 (it’ll still receive “kafka:9092” in metadata and be unable to resolve that host). We need dual listeners (one internal, one host-accessible) or align advertised listeners with the host endpoint.

Affected locations:

docs/integrate/kafka/docker-python.md, lines 19–21: Kafka environment variables

docs/integrate/kafka/docker-python.md, lines 27–28: port mappings

The Python consumer snippet (bootstrap.servers) around line 104 in the same file

Proposed minimal diff:
 services:
   kafka:
     environment:
-      - KAFKA_CFG_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093
-      - KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092
-      - KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT
+      - KAFKA_CFG_LISTENERS=PLAINTEXT://:9092,PLAINTEXT_HOST://:9094,CONTROLLER://:9093
+      - KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092,PLAINTEXT_HOST://localhost:9094
+      - KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT,CONTROLLER:PLAINTEXT

 services:
   kafka:
     ports:
-      - "9092:9092"
+      - "9092:9092"
+      - "9094:9094"   # expose the host listener for external clients
Also, ensure the Python consumer’s bootstrap setting matches your choice:

For dual listeners: keep "bootstrap.servers": "localhost:9092" (host listener on 9094 or remap to 9092)

Or run the consumer inside the demo network and use "bootstrap.servers": "kafka:9092" (single listener)

Update the code sample at line 104 accordingly.

🧹 Nitpick comments (14)

docs/integrate/kafka/attic.md (3)
11-21: Clarify that Event Hubs is Kafka API–compatible, not Kafka itself.

Readers often assume these are interchangeable. “Azure Event Hubs for Apache Kafka” exposes a Kafka-compatible endpoint but is not an Apache Kafka cluster; feature parity and operational semantics differ (e.g., partitions, offsets, retention, tooling). Add a one‑line qualifier in this list or a short note in the dropdown to prevent confusion.

Example minimal tweak:
- - [Azure Event Hubs for Apache Kafka]
+ - [Azure Event Hubs for Apache Kafka] (Kafka protocol–compatible service, not Apache Kafka)
12-14: Add staleness disclaimer for third‑party managed-offerings link.

The “overview of managed Kafka offerings” blog post may age quickly. Add a brief “not exhaustive / may be outdated” note to set expectations.
- Several companies provide managed Kafka services (see the [overview of managed Kafka offerings]
- for a more complete list).
+ Several companies provide managed Kafka services (see the [overview of managed Kafka offerings]
+ for examples; note that offerings and features change frequently).
15-20: Name accuracy: “Apache Kafka on Azure”.

The link points to a Canonical marketplace listing. Consider renaming the bullet to reflect that it’s a marketplace/offered image rather than an Azure‑native managed Kafka.
- - [Apache Kafka on Azure]
+ - [Apache Kafka on Azure (Marketplace image)]
docs/integrate/kafka/docker-python.md (5)
14-14: Pin Docker image versions for reproducibility.

Using :latest can introduce breaking changes over time. Pin known‑good tags (Kafka/CrateDB versions used in your CI or tested locally).
-    image: bitnami/kafka:latest
+    image: bitnami/kafka:3.7.0
@@
-    image: crate:latest
+    image: crate:5.7.0
If you prefer floating minors, document the expected minimum versions instead.

Also applies to: 32-32

55-55: Terminology: “Admin UI” instead of “cloud UI”.

http://localhost:4200 is the local Admin UI. Replace “cloud UI” to avoid confusion with CrateDB Cloud.
-The easiest way to do this is through the CrateDB cloud UI at `http://localhost:4200` and execute this using the console:
+The easiest way to do this is through the CrateDB Admin UI at `http://localhost:4200` and execute this using the console:
121-140: Optional: reduce frequent small flushes on idle polls.

Flushing on every poll() timeout can send undersized batches during lulls. Consider a time‑based flush (e.g., every N seconds) or a larger poll() timeout to improve throughput and reduce HTTP overhead.

146-148: Note on installing confluent-kafka.

On some platforms confluent-kafka needs librdkafka or a matching manylinux wheel. Add a short note with troubleshooting or suggest a containerized run of the consumer for portability.

Happy to add a minimal Dockerfile or a pip constraints snippet that’s known to work cross‑platform.

115-117: Add minimum CrateDB version for ON CONFLICT upsert

CrateDB’s INSERT … ON CONFLICT … DO UPDATE SET … EXCLUDED.* syntax was introduced in CrateDB 3.0.0 (released May 16, 2018) (cratedb.com). To ensure this example works as shown, please add a version note to the tutorial prerequisites, for example:

• In docs/integrate/kafka/docker-python.md (around the prerequisites section)
- Prerequisites:
-  - CrateDB cluster
+ Prerequisites:
+  - CrateDB cluster (version 3.0.0 or later)
docs/integrate/kafka/index.md (6)
34-38: Avoid “exactly-once semantics” phrasing here.

Idempotent upserts mitigate duplicates but don’t by themselves provide end‑to‑end exactly‑once across Kafka → CrateDB (especially since CrateDB doesn’t support transactions). Rephrase to set accurate expectations.
-* **Operational resilience** – Use Kafka as a durable buffer between CrateDB and data producers. Idempotent upserts (exactly-once semantics) reduce data-loss and duplication risks.
+* **Operational resilience** – Use Kafka as a durable buffer between producers and CrateDB. Idempotent upserts reduce duplication risks and improve recovery from retries.
41-43: Qualify PostgreSQL connector compatibility and link constraints.

“Largely compatible” is true for many cases, but specifics (no transactions, type mappings, DDL features) matter. Add a short bullet list of caveats or link to a compatibility page. This reduces surprises when retries/rollbacks occur.

I can draft a concise “Compatibility notes with PostgreSQL connectors” section and cross‑link it here.

23-25: Remove duplicated tagline to reduce repetition.

This muted paragraph repeats the page description above. Consider removing for brevity.
-:::{div} sd-text-muted
-Apache Kafka is a widely used open-source distributed event-store and streaming platform.
-:::
+
82-86: Clarify managed services wording.

Slightly tune the managed‑services bullet to note that some offerings are Kafka protocol–compatible (e.g., Event Hubs) rather than Kafka itself; link to the attic page if you want a longer list.
-* **Managed Kafka services** – Use a provider such as Confluent Cloud or AWS MSK to offload the operational heavy lifting of Kafka. You can still connect these managed clusters directly to a CrateDB deployment that you operate. CrateDB is also available on the major cloud providers as well.
+* **Managed Kafka services** – Use a provider such as Confluent Cloud or AWS MSK to offload Kafka operations. Some services (e.g., Azure Event Hubs) provide Kafka‑compatible endpoints rather than Kafka itself. Any of these can connect to a CrateDB deployment you operate or to CrateDB Cloud.
132-136: Optional: consider adding the managed offerings attic to the toctree (hidden).

If you expect to reference it elsewhere, adding it here (even hidden) makes discovery in search easier.
 ```{toctree}
 :hidden:
 docker-python
 kafka-connect
+attic
---

`6-8`: **Anchor and naming alignment (minor).**

Per the doc restructuring pattern you’ve used elsewhere, tutorials sometimes follow a “-learn” convention. No action required, but consider whether anchors like `(kafka-docker-python)` should follow that pattern for consistency.


Would you like me to propose a consistent anchor/name scheme across integration tutorials?

</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: CodeRabbit UI

**Review profile**: CHILL

**Plan**: Pro

**💡 Knowledge Base configuration:**

- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between c4c1313917079286fe724edf3358ee1fb72f59fa and 9956eb608746d17ba8f6b21b1922223636132526.

</details>

<details>
<summary>📒 Files selected for processing (4)</summary>

* `docs/integrate/kafka/attic.md` (1 hunks)
* `docs/integrate/kafka/docker-python.md` (1 hunks)
* `docs/integrate/kafka/index.md` (3 hunks)
* `docs/integrate/kafka/kafka-connect.md` (2 hunks)

</details>

<details>
<summary>🚧 Files skipped from review as they are similar to previous changes (1)</summary>

* docs/integrate/kafka/kafka-connect.md

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>🧠 Learnings (1)</summary>

<details>
<summary>📚 Learning: 2025-08-09T16:19:43.966Z</summary>
Learnt from: amotl
PR: #238
File: docs/integrate/azure-functions/learn.rst:1-1
Timestamp: 2025-08-09T16:19:43.966Z
Learning: In the CrateDB Guide documentation, main integration anchors (e.g., azure-functions) are intentionally placed in the index.md files of their respective integration folders, while detailed tutorials use the -learn suffix (e.g., azure-functions-learn) in their learn.rst or learn.md files. This is a deliberate architectural pattern for the documentation restructuring.
**Applied to files:**
- `docs/integrate/kafka/index.md`

</details>

</details><details>
<summary>🪛 LanguageTool</summary>

<details>
<summary>docs/integrate/kafka/index.md</summary>

[grammar] ~20-~20: There might be a mistake here.
Context: ...fix ```  :::{include} /_include/links.md :::  :::{div} sd-text-muted Apache Kafka...

(QB_NEW_EN)

---

[grammar] ~23-~23: There might be a mistake here.
Context: ...ude/links.md :::  :::{div} sd-text-muted Apache Kafka is a widely used open-sourc...

(QB_NEW_EN)

---

[grammar] ~24-~24: There might be a mistake here.
Context: ...uted event-store and streaming platform. :::   ## Overview  [Apache Kafka] is a d...

(QB_NEW_EN)

---

[grammar] ~35-~35: There might be a mistake here.
Context: ...s, consumers, and CrateDB independently. * **Near-real-time analytics** – New events ...

(QB_NEW_EN)

---

[grammar] ~36-~36: There might be a mistake here.
Context: ...n, exposed via SQL to standard BI tools. * **Operational resilience** – Use Kafka as ...

(QB_NEW_EN)

---

[grammar] ~73-~73: There might be a mistake here.
Context: ...ts directly from queries over live data. *   **ETL / ELT landing zone for downstream an...

(QB_NEW_EN)

---

[grammar] ~82-~82: There might be a mistake here.
Context: ...anage scaling, upgrades, and monitoring. * **Managed Kafka services** – Use a provide...

(QB_NEW_EN)

---

[grammar] ~84-~84: There might be a mistake here.
Context: ...providers as well. * **Managed CrateDB** – Crate\.io offers CrateDB Cloud, which can pair...

(QB_NEW_EN)

---

[grammar] ~91-~91: There might be a mistake here.
Context: ...vice\_id) to keep ordering where needed. * **Table modelling in CrateDB** – Choose pr...

(QB_NEW_EN)

---

[grammar] ~92-~92: There might be a mistake here.
Context: ...formance - define shard count per table. * **Upserts & deduplication** – Include a st...

(QB_NEW_EN)

---

[grammar] ~93-~93: There might be a mistake here.
Context: ...to make writes idempotent when possible. * **Batching & back-pressure** – Tune sink b...

(QB_NEW_EN)

---

[grammar] ~94-~94: There might be a mistake here.
Context: ...gest capacity while keeping latency low. * **Schema & types** – Keep payloads consist...

(QB_NEW_EN)

---

[grammar] ~95-~95: There might be a mistake here.
Context: ...imestamps/time zones, numerics, arrays). * **Retention & replay** – Kafka retention d...

(QB_NEW_EN)

---

[grammar] ~96-~96: There might be a mistake here.
Context: ...plan storage and compaction accordingly. * **Observability** – Monitor producer lag, ...

(QB_NEW_EN)

---

[grammar] ~101-~101: There might be a mistake here.
Context: ...ry latency.  ## Learn more  ::::{grid} 2 :gutter: 2  :::{grid-item-card} Tutorial...

(QB_NEW_EN)

</details>
<details>
<summary>docs/integrate/kafka/attic.md</summary>

[grammar] ~4-~4: There might be a mistake here.
Context: --- orphan: true --- :::{rubric} About :::  [Apache Kafka] is an open-source di...

(QB_NEW_EN)

---

[grammar] ~7-~7: There might be a mistake here.
Context: ...ributed event streaming platform used by thousands of companies for high-performa...

(QB_NEW_EN)

---

[grammar] ~8-~8: There might be a mistake here.
Context: ...nce data pipelines, streaming analytics, data integration, and mission-critical a...

(QB_NEW_EN)

---

[grammar] ~11-~11: There might be a mistake here.
Context: ...tions.   :::{dropdown} **Managed Kafka** Several companies provide managed Kafka ...

(QB_NEW_EN)

---

[grammar] ~12-~12: There might be a mistake here.
Context: ...he [overview of managed Kafka offerings] for a more complete list).  - [Aiven for...

(QB_NEW_EN)

---

[grammar] ~15-~15: There might be a mistake here.
Context: ...mplete list).  - [Aiven for Apache Kafka] - [Amazon Managed Streaming for Apache Kafk...

(QB_NEW_EN)

---

[grammar] ~16-~16: There might be a mistake here.
Context: ...Managed Streaming for Apache Kafka (MSK)] - [Apache Kafka on Azure] - [Azure Event Hu...

(QB_NEW_EN)

---

[grammar] ~17-~17: There might be a mistake here.
Context: ...he Kafka (MSK)] - [Apache Kafka on Azure] - [Azure Event Hubs for Apache Kafka] - [Co...

(QB_NEW_EN)

---

[grammar] ~18-~18: There might be a mistake here.
Context: ...re] - [Azure Event Hubs for Apache Kafka] - [Confluent Cloud] - [DoubleCloud Managed ...

(QB_NEW_EN)

---

[grammar] ~19-~19: There might be a mistake here.
Context: ...ubs for Apache Kafka] - [Confluent Cloud] - [DoubleCloud Managed Service for Apache K...

(QB_NEW_EN)

</details>
<details>
<summary>docs/integrate/kafka/docker-python.md</summary>

[grammar] ~1-~1: There might be a mistake here.
Context: (kafka-docker-python)= # Using Kafka with Python  This walkthroug...

(QB_NEW_EN)

---

[grammar] ~4-~4: There might be a mistake here.
Context: ...w to load data from a Kafka topic into a CrateDB table, using a Python consumer a...

(QB_NEW_EN)

---

[grammar] ~50-~50: There might be a mistake here.
Context: ... CrateDB and Kafka  * CrateDB Admin UI: `http://localhost:4200` * Kafka broker (inside-compose hostname): ...

(QB_NEW_EN)

</details>

</details>

</details>

</details>

<details>
<summary>🔇 Additional comments (2)</summary><blockquote>

<details>
<summary>docs/integrate/kafka/docker-python.md (1)</summary><blockquote>

`59-59`: **Clarify TIMESTAMPTZ alias support (CrateDB ≥ 4.0)**

CrateDB recognizes `TIMESTAMPTZ` as an alias for `TIMESTAMP WITH TIME ZONE` since version 4.0 (though it was deprecated in 4.x and may change in future releases) ([cratedb.com](https://cratedb.com/docs/crate/reference/en/4.8/general/ddl/data-types.html?utm_source=chatgpt.com)), and continues to do so in 5.x ([cratedb.com](https://cratedb.com/docs/crate/reference/en/5.9/general/ddl/data-types.html?utm_source=chatgpt.com)).

• File: docs/integrate/kafka/docker-python.md (line 59)  
• Recommend using the explicit type for portability, or, if you keep the alias, note the minimum version.

```diff
-    device_id TEXT, ts TIMESTAMPTZ,
+    device_id TEXT, ts TIMESTAMP WITH TIME ZONE,
Optional note if retaining TIMESTAMPTZ:

Note: Requires CrateDB ≥ 4.0 for TIMESTAMPTZ support.
docs/integrate/kafka/index.md (1)

101-110: Nice cross-link to the Python tutorial.

The Learn card aligns with the new tutorial and uses consistent wording with the tutorial’s intro. Looks good.

docs/integrate/kafka/docker-python.md

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

docs/conf.py (1)
39-41: Consider removing the old repo entry or consolidating both patterns

You now ignore both crate-workbench and crate URLs. If the intention was to replace the former, drop the old entry; otherwise, consider a single regex to cover both for maintainability.

Option A — remove the outdated entry:
-    r"https://github.com/crate-workbench/cratedb-toolkit/actions/runs/.*",
     r"https://github.com/crate/cratedb-toolkit/actions/runs/.*",
Option B — consolidate into one pattern:
-    r"https://github.com/crate-workbench/cratedb-toolkit/actions/runs/.*",
-    r"https://github.com/crate/cratedb-toolkit/actions/runs/.*",
+    r"https://github.com/crate(?:-workbench)?/cratedb-toolkit/actions/runs/.*",

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9956eb6 and 9956140.

📒 Files selected for processing (1)

docs/conf.py (1 hunks)

docs/conf.py

kneth · 2025-08-25T12:48:11Z

docs/integrate/kafka/docker-python.md

+
+### Create a demo table in CrateDB
+
+The easiest way to do this is through the CrateDB cloud UI at `http://localhost:4200` and execute this using the console:


Suggested change

The easiest way to do this is through the CrateDB cloud UI at `http://localhost:4200` and execute this using the console:

The easiest way to do this is through the CrateDB Admin UI at `http://localhost:4200` and execute this using the console:

All your suggestions have been implemented with 15e80bf, thanks!

kneth · 2025-08-25T12:48:32Z

docs/integrate/kafka/docker-python.md

+);
+```
+
+But this can also be done using curl:


Suggested change

But this can also be done using curl:

But this can also be done using `curl`:

kneth · 2025-08-25T12:49:16Z

docs/integrate/kafka/docker-python.md

+
+### Create a Kafka topic and send a couple of messages
+
+This can be done in several ways, but we can use **docker-exec** in this way:


Suggested change

This can be done in several ways, but we can use **docker-exec** in this way:

To create a Kafka topic can be done in several ways, but we can use **docker-exec** in this way:

kneth · 2025-08-25T12:55:24Z

docs/integrate/kafka/index.md

+
+For more advanced pipelines, you can process events while they’re still in Kafka before they ever reach CrateDB. Frameworks like Flink, Kafka Streams, or Spark let you enrich records, join multiple streams together, run aggregations, or apply windowing functions in real time.
+
+The processed results are then written into CrateDB, where they’re immediately available for SQL queries and dashboards. This approach is powerful when raw events need to be cleaned, combined, or summarised before storage, though it adds more moving parts compared to a simple connector.


Suggested change

The processed results are then written into CrateDB, where they’re immediately available for SQL queries and dashboards. This approach is powerful when raw events need to be cleaned, combined, or summarised before storage, though it adds more moving parts compared to a simple connector.

The processed results are then written into CrateDB, where they’re immediately available for SQL queries and dashboards. This approach is powerful when raw events need to be cleaned, combined, or summarised before storing them, though it adds moving parts compared to a simple connector.

kneth · 2025-08-25T12:57:17Z

docs/integrate/kafka/index.md

+
+How you run Kafka and CrateDB depends a lot on your environment and preferences. The most common approaches are:
+
+* **Containerised on-premise** – Run both Kafka and CrateDB on Docker or Kubernetes in your own data centre or private cloud. This gives you the most control, but also means you manage scaling, upgrades, and monitoring.


Suggested change

* **Containerised on-premise** – Run both Kafka and CrateDB on Docker or Kubernetes in your own data centre or private cloud. This gives you the most control, but also means you manage scaling, upgrades, and monitoring.

* **Containerised on-premise** – Run both Kafka and CrateDB on Docker or Kubernetes in your own data centre or private cloud. This gives you the most control, but also means you manage scaling, upgrading, and monitoring.

coderabbitai bot reviewed Aug 22, 2025

View reviewed changes

docs/integrate/kafka/docker-python.md Show resolved Hide resolved

amotl force-pushed the kafka branch from c4c1313 to 9956eb6 Compare August 22, 2025 17:59

coderabbitai bot reviewed Aug 22, 2025

View reviewed changes

docs/integrate/kafka/docker-python.md Show resolved Hide resolved

amotl requested review from bmunkholm, kneth and surister August 22, 2025 18:11

amotl marked this pull request as ready for review August 22, 2025 18:11

coderabbitai bot reviewed Aug 22, 2025

View reviewed changes

docs/conf.py Outdated Show resolved Hide resolved

amotl force-pushed the kafka branch from 9956140 to dff6b1f Compare August 22, 2025 19:03

amotl added new content New content being added. guidance Matters of layout, shape, and structure. labels Aug 24, 2025

amotl mentioned this pull request Aug 24, 2025

Integrate/Telegraf: Add micro tutorial. Improve layout on index page. #267

Merged

kneth approved these changes Aug 25, 2025

View reviewed changes

zolbatar and others added 3 commits August 26, 2025 23:30

Integrate/Kafka: Rework section. Add Python tutorial.

4640685

Integrate/Kafka: Implement suggestions by CodeRabbit

f1e17c4

Integrate/Kafka: Implement suggestions by Kenneth

15e80bf

amotl force-pushed the kafka branch from dff6b1f to 15e80bf Compare August 26, 2025 21:34

Integrate/Kafka: Implement suggestions by CodeRabbit

125edb5

amotl merged commit a13bf28 into main Aug 26, 2025
3 checks passed

amotl deleted the kafka branch August 26, 2025 22:10

This was referenced Sep 15, 2025

Prometheus: Usage guide #300

Merged

LlamaIndex: Refactor Text-to-SQL synopsis. Add full tutorial. #316

Merged

NiFi: Usage guide #280

Merged


		### Create a demo table in CrateDB

		The easiest way to do this is through the CrateDB cloud UI at `http://localhost:4200` and execute this using the console:

	But this can also be done using curl:
	But this can also be done using `curl`:


		### Create a Kafka topic and send a couple of messages

		This can be done in several ways, but we can use docker-exec in this way:

	This can be done in several ways, but we can use docker-exec in this way:
	To create a Kafka topic can be done in several ways, but we can use docker-exec in this way:


		For more advanced pipelines, you can process events while they’re still in Kafka before they ever reach CrateDB. Frameworks like Flink, Kafka Streams, or Spark let you enrich records, join multiple streams together, run aggregations, or apply windowing functions in real time.

		The processed results are then written into CrateDB, where they’re immediately available for SQL queries and dashboards. This approach is powerful when raw events need to be cleaned, combined, or summarised before storage, though it adds more moving parts compared to a simple connector.


		How you run Kafka and CrateDB depends a lot on your environment and preferences. The most common approaches are:

		* Containerised on-premise – Run both Kafka and CrateDB on Docker or Kubernetes in your own data centre or private cloud. This gives you the most control, but also means you manage scaling, upgrades, and monitoring.

Integrate/Kafka: Rework section. Add Python tutorial. #263

Integrate/Kafka: Rework section. Add Python tutorial. #263

Uh oh!

Conversation

amotl commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

About

References

Uh oh!

coderabbitai bot commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kneth Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

amotl Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

kneth Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

kneth Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

kneth Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

kneth Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amotl commented Aug 22, 2025 •

edited

Loading

coderabbitai bot commented Aug 22, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)