Skip to content

Conversation

amotl
Copy link
Member

@amotl amotl commented Aug 22, 2025

Copy link

coderabbitai bot commented Aug 22, 2025

Warning

Rate limit exceeded

@amotl has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 7 minutes and 35 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between dff6b1f and 125edb5.

📒 Files selected for processing (4)
  • docs/integrate/kafka/attic.md (1 hunks)
  • docs/integrate/kafka/docker-python.md (1 hunks)
  • docs/integrate/kafka/index.md (3 hunks)
  • docs/integrate/kafka/kafka-connect.md (2 hunks)

Walkthrough

Adds three Kafka documentation pages (attic list of managed offerings, Docker+Python ingestion tutorial, expanded Kafka index), small edits to the Kafka Connect page, and a docs/conf.py linkcheck_ignore entry. All changes are documentation-only; no runtime, API, or exported-public-entity changes.

Changes

Cohort / File(s) Summary of changes
Kafka index rewrite
docs/integrate/kafka/index.md
Rewritten and expanded Kafka landing page: new overview and CI badge, Benefits of CrateDB + Apache Kafka, Common Ingestion Options (Kafka Connect, custom consumers, stream processors), Typical use cases, Deployment options, Key design considerations, updated Learn grid (adds Docker+Python tutorial), and removal of several vendor reference links.
Docker + Python tutorial
docs/integrate/kafka/docker-python.md
New end-to-end guide with Docker Compose (Kafka + CrateDB), topic creation and sample producers, quick_consumer.py using confluent_kafka to batch (≤500) messages and POST to CrateDB /_bulk with INSERT ... ON CONFLICT upserts, run/verification steps, and message-format notes.
Kafka Connect tweaks
docs/integrate/kafka/kafka-connect.md
Title changed to “Using Confluent Kafka Connect”; docker-compose commands updated to docker compose; no semantic content changes.
Managed Kafka attic doc
docs/integrate/kafka/attic.md
New orphan doc listing managed Kafka offerings (Aiven, MSK, Azure Kafka, Azure Event Hubs for Kafka, Confluent Cloud, DoubleCloud) with reference links and an overview link.
Docs config change
docs/conf.py
Added linkcheck_ignore entry for GitHub Actions runs at https://github.com/crate/cratedb-toolkit/actions/runs/.* (alongside existing crate-workbench pattern).

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor User
    participant Kafka as Kafka cluster
    participant Consumer as Python consumer
    participant Crate as CrateDB HTTP API

    User->>Kafka: produce JSON messages to topic "sensors"
    loop Polling
        Consumer->>Kafka: poll()
        Kafka-->>Consumer: messages (JSON)
        Consumer->>Consumer: parse & batch (<=500)
    end
    alt Batch ready or shutdown
        Consumer->>Crate: POST /_bulk (INSERT ... ON CONFLICT ...)
        Crate-->>Consumer: 200 OK / errors
    else Idle
        Consumer->>Consumer: wait/backoff
    end
    note over Consumer,Crate: Batched idempotent upserts via HTTP bulk endpoint
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

standing-1200, new content, refactoring, guidance

Suggested reviewers

  • bmunkholm
  • kneth
  • surister

Poem

I hop through queues with tiny feet,
I batch my carrots nice and neat.
I sip the streams, then gently write—
Upsert the garden through the night.
Happy data, snug and bright. 🥕

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch kafka

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
docs/integrate/kafka/kafka-connect.md (2)

60-67: Fix invalid JSON payload (trailing comma).

The example JSON has a trailing comma after the last field, which makes it invalid and will break copy/paste usage.

 {
    "sensor_id":101,
    "timestamp":"2022-06-12T19:00:00Z",
    "temperature":22.5,
-   "humidity":60,
+   "humidity":60
 }

366-369: Fix schema name used in SELECT examples (should be doc, not crate).

crate is a user, not a schema. The default schema for user tables is doc. The current examples will fail as written.

-SELECT *
-FROM "crate"."sensortable"
+SELECT *
+FROM "doc"."sensortable"
 LIMIT 100;
-cr> SELECT * FROM crate.sensortable LIMIT 5;
+cr> SELECT * FROM doc.sensortable LIMIT 5;

Also applies to: 374-385

🧹 Nitpick comments (12)
docs/integrate/kafka/kafka-connect.md (4)

231-237: Use modern Docker Compose syntax consistently.

Prefer docker compose (v2) over docker-compose (v1) and keep it consistent throughout.

-docker-compose up -d
+docker compose up -d
-docker-compose ps
+docker compose ps

Also applies to: 243-245


200-209: Pin CrateDB image to a specific version for reproducibility.

Using latest risks drift and inconsistent results across runs. Pin to a known-good CrateDB release.

-  cratedb:
-    image: crate:latest
+  cratedb:
+    image: crate:5.5.0

If a different version is intended, replace with the exact tag you’ve validated.


261-281: Be explicit with schema and avoid ambiguous settings in the sink config.

  • Explicitly qualify the table with the doc schema to match CrateDB defaults.
  • Consider removing schema.pattern for sink use; it’s more typical on source connectors and can be confusing here.
   "config": {
     "connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
     "tasks.max": "1",
     "topics": "sensor-data-topic",
     "connection.url": "jdbc:postgresql://cratedb:5432/crate?user=crate&sslmode=disable",
     "dialect.name": "PostgreSqlDatabaseDialect",
     "auto.create": "true",
     "auto.evolve": "true",
     "insert.mode": "insert",
     "pk.mode": "none",
     "value.converter": "io.confluent.connect.avro.AvroConverter",
     "value.converter.schemas.enable": "true",
     "value.converter.schema.registry.url": "http://schema-registry:8081",
     "key.converter": "org.apache.kafka.connect.storage.StringConverter",
     "key.converter.schemas.enable": "false",
-    "table.name.format": "sensortable",
-    "schema.pattern": "doc"
+    "table.name.format": "doc.sensortable"
   }

108-113: Pin the Kafka Connect JDBC plugin version.

latest can introduce breaking changes. Pin to a validated version to keep the tutorial stable.

-  bash -c "confluent-hub install --no-prompt confluentinc/kafka-connect-jdbc:latest"
+  bash -c "confluent-hub install --no-prompt confluentinc/kafka-connect-jdbc:10.7.5"

Replace with the actual version you’ve tested.

docs/integrate/kafka/attic.md (1)

11-21: Double-check provider naming/links and clarify non-exhaustive list.

Names and URLs for managed services can change. A brief note like “selection, not exhaustive” is already present; consider adding the month/year to set expectations.

Example addition after the dropdown intro: “List verified August 2025.”

docs/integrate/kafka/docker-python.md (5)

32-38: Pin CrateDB image version for deterministic results.

Avoid latest in tutorials.

-  cratedb:
-    image: crate:latest
+  cratedb:
+    image: crate:5.5.0

50-56: Tighten wording and avoid bare URL.

  • Say “Admin UI” instead of “cloud UI” here.
  • Wrap bare URL in angle brackets to satisfy MD034.
-* CrateDB Admin UI: http://localhost:4200
+* CrateDB Admin UI: <http://localhost:4200>
@@
-The easiest way to do this is through the CrateDB cloud UI at http://localhost:4200 and execute this using the console:
+The easiest way is through the CrateDB Admin UI at <http://localhost:4200> and execute this using the console:

98-141: Consumer: commit strategy and robustness.

Current flow relies on default auto-commit timing and may reprocess on restart. Since writes are idempotent via ON CONFLICT, you can safely commit after a successful batch to reduce lag and speed recovery. Also log/handle HTTP exceptions.

 from confluent_kafka import Consumer
@@
 c = Consumer({
     "bootstrap.servers": "localhost:9092",
     "group.id": "demo",
     "auto.offset.reset": "earliest",
+    "enable.auto.commit": False,
 })
@@
 def insert_batch(rows):
@@
-    r = requests.post(SQL_ENDPOINT, json=body, timeout=10)
-    r.raise_for_status()
+    r = requests.post(SQL_ENDPOINT, json=body, timeout=10)
+    r.raise_for_status()
+    return True
@@
-        if len(batch) >= 500:
-            insert_batch(batch)
-            batch.clear()
+        if len(batch) >= 500:
+            if insert_batch(batch):
+                c.commit(asynchronous=False)
+            batch.clear()
 finally:
-    if batch: insert_batch(batch)
+    if batch:
+        if insert_batch(batch):
+            c.commit(asynchronous=False)
     c.close()

1-2: Anchor/name consistency with Guide conventions.

Per prior restructuring, main anchors live in integration index.md and detailed tutorials often use a -learn suffix. Consider renaming the ref to kafka-python-learn (or kafka-docker-python-learn) to stay consistent, and update the index toctree accordingly.

We learned this convention from earlier feedback and can apply it here for navigational consistency.

Also applies to: 104-109


144-149: Minor: suggest virtualenv note before pip install.

A short note to use a virtual environment helps keep user environments clean.

Example addition above the commands: “Optionally create and activate a virtual environment (e.g., python -m venv .venv && source .venv/bin/activate) before installing packages.”

docs/integrate/kafka/index.md (2)

41-43: Soften the “99% compatible” claim and clarify transaction note.

Claims with precise percentages can mislead and go stale. Recommend “largely compatible” and explicitly mention notable differences (e.g., transactions unsupported) without quantifying.

-:::{important}
-The Apache Kafka PostgreSQL connector is 99% compatible with CrateDB. CrateDB does not support transactions, but this only becomes notable when either ingestion to CrateDB is failing or there is some other connection issue.&#x20;
-:::
+:::{important}
+The Apache Kafka PostgreSQL connector is largely compatible with CrateDB. Note that CrateDB does not support transactions; this generally matters only during failures or connection issues that require retries or rollbacks.
+:::

84-85: Remove unnecessary escape in “Crate.io”.

The backslash before the dot isn’t needed.

-* **Managed CrateDB** – Crate\.io offers CrateDB Cloud, which can pair with either self-managed Kafka or managed Kafka services. This option reduces database operations to a minimum.
+* **Managed CrateDB** – Crate.io offers CrateDB Cloud, which can pair with either self-managed Kafka or managed Kafka services. This option reduces database operations to a minimum.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f05a2b2 and c4c1313.

📒 Files selected for processing (4)
  • docs/integrate/kafka/attic.md (1 hunks)
  • docs/integrate/kafka/docker-python.md (1 hunks)
  • docs/integrate/kafka/index.md (3 hunks)
  • docs/integrate/kafka/kafka-connect.md (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: amotl
PR: crate/cratedb-guide#238
File: docs/integrate/azure-functions/learn.rst:1-1
Timestamp: 2025-08-09T16:19:43.966Z
Learning: In the CrateDB Guide documentation, main integration anchors (e.g., `azure-functions`) are intentionally placed in the `index.md` files of their respective integration folders, while detailed tutorials use the `-learn` suffix (e.g., `azure-functions-learn`) in their `learn.rst` or `learn.md` files. This is a deliberate architectural pattern for the documentation restructuring.
📚 Learning: 2025-08-09T16:19:43.966Z
Learnt from: amotl
PR: crate/cratedb-guide#238
File: docs/integrate/azure-functions/learn.rst:1-1
Timestamp: 2025-08-09T16:19:43.966Z
Learning: In the CrateDB Guide documentation, main integration anchors (e.g., `azure-functions`) are intentionally placed in the `index.md` files of their respective integration folders, while detailed tutorials use the `-learn` suffix (e.g., `azure-functions-learn`) in their `learn.rst` or `learn.md` files. This is a deliberate architectural pattern for the documentation restructuring.

Applied to files:

  • docs/integrate/kafka/index.md
🪛 LanguageTool
docs/integrate/kafka/attic.md

[grammar] ~4-~4: There might be a mistake here.
Context: --- orphan: true --- :::{rubric} About ::: [Apache Kafka] is an open-source di...

(QB_NEW_EN)


[grammar] ~7-~7: There might be a mistake here.
Context: ...ributed event streaming platform used by thousands of companies for high-performa...

(QB_NEW_EN)


[grammar] ~8-~8: There might be a mistake here.
Context: ...nce data pipelines, streaming analytics, data integration, and mission-critical a...

(QB_NEW_EN)


[grammar] ~11-~11: There might be a mistake here.
Context: ...tions. :::{dropdown} Managed Kafka Several companies provide managed Kafka ...

(QB_NEW_EN)


[grammar] ~12-~12: There might be a mistake here.
Context: ...he [overview of managed Kafka offerings] for a more complete list). - [Aiven for...

(QB_NEW_EN)


[grammar] ~15-~15: There might be a mistake here.
Context: ...mplete list). - [Aiven for Apache Kafka] - [Amazon Managed Streaming for Apache Kafk...

(QB_NEW_EN)


[grammar] ~16-~16: There might be a mistake here.
Context: ...Managed Streaming for Apache Kafka (MSK)] - [Apache Kafka on Azure] - [Azure Event Hu...

(QB_NEW_EN)


[grammar] ~17-~17: There might be a mistake here.
Context: ...he Kafka (MSK)] - [Apache Kafka on Azure] - [Azure Event Hubs for Apache Kafka] - [Co...

(QB_NEW_EN)


[grammar] ~18-~18: There might be a mistake here.
Context: ...re] - [Azure Event Hubs for Apache Kafka] - [Confluent Cloud] - [DoubleCloud Managed ...

(QB_NEW_EN)


[grammar] ~19-~19: There might be a mistake here.
Context: ...ubs for Apache Kafka] - [Confluent Cloud] - [DoubleCloud Managed Service for Apache K...

(QB_NEW_EN)

docs/integrate/kafka/docker-python.md

[grammar] ~1-~1: There might be a mistake here.
Context: (kafka-docker-python)= # Using Kafka with Python This walkthroug...

(QB_NEW_EN)


[grammar] ~4-~4: There might be a mistake here.
Context: ...w to load data from a Kafka topic into a CrateDB table, using a Python consumer a...

(QB_NEW_EN)


[grammar] ~50-~50: There might be a mistake here.
Context: ... CrateDB Admin UI: http://localhost:4200 * Kafka broker (inside-compose hostname): ...

(QB_NEW_EN)

docs/integrate/kafka/index.md

[grammar] ~20-~20: There might be a mistake here.
Context: ...fix ``` :::{include} /_include/links.md ::: :::{div} sd-text-muted Apache Kafka...

(QB_NEW_EN)


[grammar] ~23-~23: There might be a mistake here.
Context: ...ude/links.md ::: :::{div} sd-text-muted Apache Kafka is a widely used open-sourc...

(QB_NEW_EN)


[grammar] ~24-~24: There might be a mistake here.
Context: ...uted event-store and streaming platform. ::: ## Overview [Apache Kafka] is a d...

(QB_NEW_EN)


[grammar] ~35-~35: There might be a mistake here.
Context: ...s, consumers, and CrateDB independently. * Near-real-time analytics – New events ...

(QB_NEW_EN)


[grammar] ~36-~36: There might be a mistake here.
Context: ...n, exposed via SQL to standard BI tools. * Operational resilience – Use Kafka as ...

(QB_NEW_EN)


[grammar] ~73-~73: There might be a mistake here.
Context: ...ts directly from queries over live data. * **ETL / ELT landing zone for downstream an...

(QB_NEW_EN)


[grammar] ~82-~82: There might be a mistake here.
Context: ...anage scaling, upgrades, and monitoring. * Managed Kafka services – Use a provide...

(QB_NEW_EN)


[grammar] ~84-~84: There might be a mistake here.
Context: ...providers as well. * Managed CrateDB – Crate.io offers CrateDB Cloud, which can pair...

(QB_NEW_EN)


[grammar] ~91-~91: There might be a mistake here.
Context: ...vice_id) to keep ordering where needed. * Table modelling in CrateDB – Choose pr...

(QB_NEW_EN)


[grammar] ~92-~92: There might be a mistake here.
Context: ...formance - define shard count per table. * Upserts & deduplication – Include a st...

(QB_NEW_EN)


[grammar] ~93-~93: There might be a mistake here.
Context: ...to make writes idempotent when possible. * Batching & back-pressure – Tune sink b...

(QB_NEW_EN)


[grammar] ~94-~94: There might be a mistake here.
Context: ...gest capacity while keeping latency low. * Schema & types – Keep payloads consist...

(QB_NEW_EN)


[grammar] ~95-~95: There might be a mistake here.
Context: ...imestamps/time zones, numerics, arrays). * Retention & replay – Kafka retention d...

(QB_NEW_EN)


[grammar] ~96-~96: There might be a mistake here.
Context: ...plan storage and compaction accordingly. * Observability – Monitor producer lag, ...

(QB_NEW_EN)


[grammar] ~101-~101: There might be a mistake here.
Context: ...ry latency. ## Learn more ::::{grid} 2 :gutter: 2 :::{grid-item-card} Tutorial...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
docs/integrate/kafka/docker-python.md

119-119: Bare URL used

(MD034, no-bare-urls)


125-125: Bare URL used

(MD034, no-bare-urls)


144-144: Link and image reference definitions should be needed
Unused link or image reference definition: "cratedb and apache kafka"

(MD053, link-image-reference-definitions)

🔇 Additional comments (4)
docs/integrate/kafka/kafka-connect.md (1)

2-2: Title update aligns with content—good change.

Renaming to “Using Confluent Kafka Connect” better reflects the page scope and matches the new index card. Please ensure any inbound links or references in the repo that still use the old title are redirected or updated.

docs/integrate/kafka/attic.md (1)

1-3: Front-matter “orphan: true” is appropriate here.

This keeps the page out of nav while remaining linkable. LGTM.

docs/integrate/kafka/index.md (2)

14-15: Nice addition: CI badge for Ingestr + Kafka.

This adds useful social proof and a quick status check. LGTM.


101-109: Good cross-link: Docker+Python tutorial is clearly surfaced.

Card copy mirrors the tutorial opening, and the :link-type: ref matches the page anchor. Once you finalize the tutorial ref name (see suggestion in that file), remember to update this link if it changes.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
docs/integrate/kafka/docker-python.md (1)

19-21: Fix Kafka advertised listeners to support host‐based Python clients

The current Compose setup advertises PLAINTEXT://kafka:9092, which works inside Docker but breaks a host‐based consumer connecting to localhost:9092 (it’ll still receive “kafka:9092” in metadata and be unable to resolve that host). We need dual listeners (one internal, one host-accessible) or align advertised listeners with the host endpoint.

Affected locations:

  • docs/integrate/kafka/docker-python.md, lines 19–21: Kafka environment variables
  • docs/integrate/kafka/docker-python.md, lines 27–28: port mappings
  • The Python consumer snippet (bootstrap.servers) around line 104 in the same file

Proposed minimal diff:

 services:
   kafka:
     environment:
-      - KAFKA_CFG_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093
-      - KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092
-      - KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=PLAINTEXT:PLAINTEXT,CONTROLLER:PLAINTEXT
+      - KAFKA_CFG_LISTENERS=PLAINTEXT://:9092,PLAINTEXT_HOST://:9094,CONTROLLER://:9093
+      - KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092,PLAINTEXT_HOST://localhost:9094
+      - KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT,CONTROLLER:PLAINTEXT

 services:
   kafka:
     ports:
-      - "9092:9092"
+      - "9092:9092"
+      - "9094:9094"   # expose the host listener for external clients

Also, ensure the Python consumer’s bootstrap setting matches your choice:

  • For dual listeners: keep "bootstrap.servers": "localhost:9092" (host listener on 9094 or remap to 9092)
  • Or run the consumer inside the demo network and use "bootstrap.servers": "kafka:9092" (single listener)

Update the code sample at line 104 accordingly.

🧹 Nitpick comments (14)
docs/integrate/kafka/attic.md (3)

11-21: Clarify that Event Hubs is Kafka API–compatible, not Kafka itself.

Readers often assume these are interchangeable. “Azure Event Hubs for Apache Kafka” exposes a Kafka-compatible endpoint but is not an Apache Kafka cluster; feature parity and operational semantics differ (e.g., partitions, offsets, retention, tooling). Add a one‑line qualifier in this list or a short note in the dropdown to prevent confusion.

Example minimal tweak:

- - [Azure Event Hubs for Apache Kafka]
+ - [Azure Event Hubs for Apache Kafka] (Kafka protocol–compatible service, not Apache Kafka)

12-14: Add staleness disclaimer for third‑party managed-offerings link.

The “overview of managed Kafka offerings” blog post may age quickly. Add a brief “not exhaustive / may be outdated” note to set expectations.

- Several companies provide managed Kafka services (see the [overview of managed Kafka offerings]
- for a more complete list).
+ Several companies provide managed Kafka services (see the [overview of managed Kafka offerings]
+ for examples; note that offerings and features change frequently).

15-20: Name accuracy: “Apache Kafka on Azure”.

The link points to a Canonical marketplace listing. Consider renaming the bullet to reflect that it’s a marketplace/offered image rather than an Azure‑native managed Kafka.

- - [Apache Kafka on Azure]
+ - [Apache Kafka on Azure (Marketplace image)]
docs/integrate/kafka/docker-python.md (5)

14-14: Pin Docker image versions for reproducibility.

Using :latest can introduce breaking changes over time. Pin known‑good tags (Kafka/CrateDB versions used in your CI or tested locally).

-    image: bitnami/kafka:latest
+    image: bitnami/kafka:3.7.0
@@
-    image: crate:latest
+    image: crate:5.7.0

If you prefer floating minors, document the expected minimum versions instead.

Also applies to: 32-32


55-55: Terminology: “Admin UI” instead of “cloud UI”.

http://localhost:4200 is the local Admin UI. Replace “cloud UI” to avoid confusion with CrateDB Cloud.

-The easiest way to do this is through the CrateDB cloud UI at `http://localhost:4200` and execute this using the console:
+The easiest way to do this is through the CrateDB Admin UI at `http://localhost:4200` and execute this using the console:

121-140: Optional: reduce frequent small flushes on idle polls.

Flushing on every poll() timeout can send undersized batches during lulls. Consider a time‑based flush (e.g., every N seconds) or a larger poll() timeout to improve throughput and reduce HTTP overhead.


146-148: Note on installing confluent-kafka.

On some platforms confluent-kafka needs librdkafka or a matching manylinux wheel. Add a short note with troubleshooting or suggest a containerized run of the consumer for portability.

Happy to add a minimal Dockerfile or a pip constraints snippet that’s known to work cross‑platform.


115-117: Add minimum CrateDB version for ON CONFLICT upsert

CrateDB’s INSERT … ON CONFLICT … DO UPDATE SET … EXCLUDED.* syntax was introduced in CrateDB 3.0.0 (released May 16, 2018) (cratedb.com). To ensure this example works as shown, please add a version note to the tutorial prerequisites, for example:

• In docs/integrate/kafka/docker-python.md (around the prerequisites section)

- Prerequisites:
-  - CrateDB cluster
+ Prerequisites:
+  - CrateDB cluster (version 3.0.0 or later)
docs/integrate/kafka/index.md (6)

34-38: Avoid “exactly-once semantics” phrasing here.

Idempotent upserts mitigate duplicates but don’t by themselves provide end‑to‑end exactly‑once across Kafka → CrateDB (especially since CrateDB doesn’t support transactions). Rephrase to set accurate expectations.

-* **Operational resilience** – Use Kafka as a durable buffer between CrateDB and data producers. Idempotent upserts (exactly-once semantics) reduce data-loss and duplication risks.
+* **Operational resilience** – Use Kafka as a durable buffer between producers and CrateDB. Idempotent upserts reduce duplication risks and improve recovery from retries.

41-43: Qualify PostgreSQL connector compatibility and link constraints.

“Largely compatible” is true for many cases, but specifics (no transactions, type mappings, DDL features) matter. Add a short bullet list of caveats or link to a compatibility page. This reduces surprises when retries/rollbacks occur.

I can draft a concise “Compatibility notes with PostgreSQL connectors” section and cross‑link it here.


23-25: Remove duplicated tagline to reduce repetition.

This muted paragraph repeats the page description above. Consider removing for brevity.

-:::{div} sd-text-muted
-Apache Kafka is a widely used open-source distributed event-store and streaming platform.
-:::
+<!-- Removed duplicated tagline -->

82-86: Clarify managed services wording.

Slightly tune the managed‑services bullet to note that some offerings are Kafka protocol–compatible (e.g., Event Hubs) rather than Kafka itself; link to the attic page if you want a longer list.

-* **Managed Kafka services** – Use a provider such as Confluent Cloud or AWS MSK to offload the operational heavy lifting of Kafka. You can still connect these managed clusters directly to a CrateDB deployment that you operate. CrateDB is also available on the major cloud providers as well.
+* **Managed Kafka services** – Use a provider such as Confluent Cloud or AWS MSK to offload Kafka operations. Some services (e.g., Azure Event Hubs) provide Kafka‑compatible endpoints rather than Kafka itself. Any of these can connect to a CrateDB deployment you operate or to CrateDB Cloud.

132-136: Optional: consider adding the managed offerings attic to the toctree (hidden).

If you expect to reference it elsewhere, adding it here (even hidden) makes discovery in search easier.

 ```{toctree}
 :hidden:
 docker-python
 kafka-connect
+attic

---

`6-8`: **Anchor and naming alignment (minor).**

Per the doc restructuring pattern you’ve used elsewhere, tutorials sometimes follow a “-learn” convention. No action required, but consider whether anchors like `(kafka-docker-python)` should follow that pattern for consistency.


Would you like me to propose a consistent anchor/name scheme across integration tutorials?

</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: CodeRabbit UI

**Review profile**: CHILL

**Plan**: Pro

**💡 Knowledge Base configuration:**

- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between c4c1313917079286fe724edf3358ee1fb72f59fa and 9956eb608746d17ba8f6b21b1922223636132526.

</details>

<details>
<summary>📒 Files selected for processing (4)</summary>

* `docs/integrate/kafka/attic.md` (1 hunks)
* `docs/integrate/kafka/docker-python.md` (1 hunks)
* `docs/integrate/kafka/index.md` (3 hunks)
* `docs/integrate/kafka/kafka-connect.md` (2 hunks)

</details>

<details>
<summary>🚧 Files skipped from review as they are similar to previous changes (1)</summary>

* docs/integrate/kafka/kafka-connect.md

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>🧠 Learnings (1)</summary>

<details>
<summary>📚 Learning: 2025-08-09T16:19:43.966Z</summary>

Learnt from: amotl
PR: #238
File: docs/integrate/azure-functions/learn.rst:1-1
Timestamp: 2025-08-09T16:19:43.966Z
Learning: In the CrateDB Guide documentation, main integration anchors (e.g., azure-functions) are intentionally placed in the index.md files of their respective integration folders, while detailed tutorials use the -learn suffix (e.g., azure-functions-learn) in their learn.rst or learn.md files. This is a deliberate architectural pattern for the documentation restructuring.


**Applied to files:**
- `docs/integrate/kafka/index.md`

</details>

</details><details>
<summary>🪛 LanguageTool</summary>

<details>
<summary>docs/integrate/kafka/index.md</summary>

[grammar] ~20-~20: There might be a mistake here.
Context: ...fix ```  :::{include} /_include/links.md :::  :::{div} sd-text-muted Apache Kafka...

(QB_NEW_EN)

---

[grammar] ~23-~23: There might be a mistake here.
Context: ...ude/links.md :::  :::{div} sd-text-muted Apache Kafka is a widely used open-sourc...

(QB_NEW_EN)

---

[grammar] ~24-~24: There might be a mistake here.
Context: ...uted event-store and streaming platform. :::   ## Overview  [Apache Kafka] is a d...

(QB_NEW_EN)

---

[grammar] ~35-~35: There might be a mistake here.
Context: ...s, consumers, and CrateDB independently. * **Near-real-time analytics** – New events ...

(QB_NEW_EN)

---

[grammar] ~36-~36: There might be a mistake here.
Context: ...n, exposed via SQL to standard BI tools. * **Operational resilience** – Use Kafka as ...

(QB_NEW_EN)

---

[grammar] ~73-~73: There might be a mistake here.
Context: ...ts directly from queries over live data. *   **ETL / ELT landing zone for downstream an...

(QB_NEW_EN)

---

[grammar] ~82-~82: There might be a mistake here.
Context: ...anage scaling, upgrades, and monitoring. * **Managed Kafka services** – Use a provide...

(QB_NEW_EN)

---

[grammar] ~84-~84: There might be a mistake here.
Context: ...providers as well. * **Managed CrateDB** – Crate\.io offers CrateDB Cloud, which can pair...

(QB_NEW_EN)

---

[grammar] ~91-~91: There might be a mistake here.
Context: ...vice\_id) to keep ordering where needed. * **Table modelling in CrateDB** – Choose pr...

(QB_NEW_EN)

---

[grammar] ~92-~92: There might be a mistake here.
Context: ...formance - define shard count per table. * **Upserts & deduplication** – Include a st...

(QB_NEW_EN)

---

[grammar] ~93-~93: There might be a mistake here.
Context: ...to make writes idempotent when possible. * **Batching & back-pressure** – Tune sink b...

(QB_NEW_EN)

---

[grammar] ~94-~94: There might be a mistake here.
Context: ...gest capacity while keeping latency low. * **Schema & types** – Keep payloads consist...

(QB_NEW_EN)

---

[grammar] ~95-~95: There might be a mistake here.
Context: ...imestamps/time zones, numerics, arrays). * **Retention & replay** – Kafka retention d...

(QB_NEW_EN)

---

[grammar] ~96-~96: There might be a mistake here.
Context: ...plan storage and compaction accordingly. * **Observability** – Monitor producer lag, ...

(QB_NEW_EN)

---

[grammar] ~101-~101: There might be a mistake here.
Context: ...ry latency.  ## Learn more  ::::{grid} 2 :gutter: 2  :::{grid-item-card} Tutorial...

(QB_NEW_EN)

</details>
<details>
<summary>docs/integrate/kafka/attic.md</summary>

[grammar] ~4-~4: There might be a mistake here.
Context: --- orphan: true --- :::{rubric} About :::  [Apache Kafka] is an open-source di...

(QB_NEW_EN)

---

[grammar] ~7-~7: There might be a mistake here.
Context: ...ributed event streaming platform used by thousands of companies for high-performa...

(QB_NEW_EN)

---

[grammar] ~8-~8: There might be a mistake here.
Context: ...nce data pipelines, streaming analytics, data integration, and mission-critical a...

(QB_NEW_EN)

---

[grammar] ~11-~11: There might be a mistake here.
Context: ...tions.   :::{dropdown} **Managed Kafka** Several companies provide managed Kafka ...

(QB_NEW_EN)

---

[grammar] ~12-~12: There might be a mistake here.
Context: ...he [overview of managed Kafka offerings] for a more complete list).  - [Aiven for...

(QB_NEW_EN)

---

[grammar] ~15-~15: There might be a mistake here.
Context: ...mplete list).  - [Aiven for Apache Kafka] - [Amazon Managed Streaming for Apache Kafk...

(QB_NEW_EN)

---

[grammar] ~16-~16: There might be a mistake here.
Context: ...Managed Streaming for Apache Kafka (MSK)] - [Apache Kafka on Azure] - [Azure Event Hu...

(QB_NEW_EN)

---

[grammar] ~17-~17: There might be a mistake here.
Context: ...he Kafka (MSK)] - [Apache Kafka on Azure] - [Azure Event Hubs for Apache Kafka] - [Co...

(QB_NEW_EN)

---

[grammar] ~18-~18: There might be a mistake here.
Context: ...re] - [Azure Event Hubs for Apache Kafka] - [Confluent Cloud] - [DoubleCloud Managed ...

(QB_NEW_EN)

---

[grammar] ~19-~19: There might be a mistake here.
Context: ...ubs for Apache Kafka] - [Confluent Cloud] - [DoubleCloud Managed Service for Apache K...

(QB_NEW_EN)

</details>
<details>
<summary>docs/integrate/kafka/docker-python.md</summary>

[grammar] ~1-~1: There might be a mistake here.
Context: (kafka-docker-python)= # Using Kafka with Python  This walkthroug...

(QB_NEW_EN)

---

[grammar] ~4-~4: There might be a mistake here.
Context: ...w to load data from a Kafka topic into a CrateDB table, using a Python consumer a...

(QB_NEW_EN)

---

[grammar] ~50-~50: There might be a mistake here.
Context: ... CrateDB and Kafka  * CrateDB Admin UI: `http://localhost:4200` * Kafka broker (inside-compose hostname): ...

(QB_NEW_EN)

</details>

</details>

</details>

</details>

<details>
<summary>🔇 Additional comments (2)</summary><blockquote>

<details>
<summary>docs/integrate/kafka/docker-python.md (1)</summary><blockquote>

`59-59`: **Clarify TIMESTAMPTZ alias support (CrateDB ≥ 4.0)**

CrateDB recognizes `TIMESTAMPTZ` as an alias for `TIMESTAMP WITH TIME ZONE` since version 4.0 (though it was deprecated in 4.x and may change in future releases) ([cratedb.com](https://cratedb.com/docs/crate/reference/en/4.8/general/ddl/data-types.html?utm_source=chatgpt.com)), and continues to do so in 5.x ([cratedb.com](https://cratedb.com/docs/crate/reference/en/5.9/general/ddl/data-types.html?utm_source=chatgpt.com)).

• File: docs/integrate/kafka/docker-python.md (line 59)  
• Recommend using the explicit type for portability, or, if you keep the alias, note the minimum version.

```diff
-    device_id TEXT, ts TIMESTAMPTZ,
+    device_id TEXT, ts TIMESTAMP WITH TIME ZONE,

Optional note if retaining TIMESTAMPTZ:

Note: Requires CrateDB ≥ 4.0 for TIMESTAMPTZ support.

docs/integrate/kafka/index.md (1)

101-110: Nice cross-link to the Python tutorial.

The Learn card aligns with the new tutorial and uses consistent wording with the tutorial’s intro. Looks good.

@amotl amotl requested review from bmunkholm, kneth and surister August 22, 2025 18:11
@amotl amotl marked this pull request as ready for review August 22, 2025 18:11
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
docs/conf.py (1)

39-41: Consider removing the old repo entry or consolidating both patterns

You now ignore both crate-workbench and crate URLs. If the intention was to replace the former, drop the old entry; otherwise, consider a single regex to cover both for maintainability.

Option A — remove the outdated entry:

-    r"https://github.com/crate-workbench/cratedb-toolkit/actions/runs/.*",
     r"https://github.com/crate/cratedb-toolkit/actions/runs/.*",

Option B — consolidate into one pattern:

-    r"https://github.com/crate-workbench/cratedb-toolkit/actions/runs/.*",
-    r"https://github.com/crate/cratedb-toolkit/actions/runs/.*",
+    r"https://github.com/crate(?:-workbench)?/cratedb-toolkit/actions/runs/.*",
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9956eb6 and 9956140.

📒 Files selected for processing (1)
  • docs/conf.py (1 hunks)

@amotl amotl added new content New content being added. guidance Matters of layout, shape, and structure. labels Aug 24, 2025

### Create a demo table in CrateDB

The easiest way to do this is through the CrateDB cloud UI at `http://localhost:4200` and execute this using the console:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The easiest way to do this is through the CrateDB cloud UI at `http://localhost:4200` and execute this using the console:
The easiest way to do this is through the CrateDB Admin UI at `http://localhost:4200` and execute this using the console:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All your suggestions have been implemented with 15e80bf, thanks!

);
```

But this can also be done using curl:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
But this can also be done using curl:
But this can also be done using `curl`:


### Create a Kafka topic and send a couple of messages

This can be done in several ways, but we can use **docker-exec** in this way:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This can be done in several ways, but we can use **docker-exec** in this way:
To create a Kafka topic can be done in several ways, but we can use **docker-exec** in this way:


For more advanced pipelines, you can process events while they’re still in Kafka before they ever reach CrateDB. Frameworks like Flink, Kafka Streams, or Spark let you enrich records, join multiple streams together, run aggregations, or apply windowing functions in real time.

The processed results are then written into CrateDB, where they’re immediately available for SQL queries and dashboards. This approach is powerful when raw events need to be cleaned, combined, or summarised before storage, though it adds more moving parts compared to a simple connector.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The processed results are then written into CrateDB, where they’re immediately available for SQL queries and dashboards. This approach is powerful when raw events need to be cleaned, combined, or summarised before storage, though it adds more moving parts compared to a simple connector.
The processed results are then written into CrateDB, where they’re immediately available for SQL queries and dashboards. This approach is powerful when raw events need to be cleaned, combined, or summarised before storing them, though it adds moving parts compared to a simple connector.


How you run Kafka and CrateDB depends a lot on your environment and preferences. The most common approaches are:

* **Containerised on-premise** – Run both Kafka and CrateDB on Docker or Kubernetes in your own data centre or private cloud. This gives you the most control, but also means you manage scaling, upgrades, and monitoring.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* **Containerised on-premise** – Run both Kafka and CrateDB on Docker or Kubernetes in your own data centre or private cloud. This gives you the most control, but also means you manage scaling, upgrades, and monitoring.
* **Containerised on-premise** – Run both Kafka and CrateDB on Docker or Kubernetes in your own data centre or private cloud. This gives you the most control, but also means you manage scaling, upgrading, and monitoring.

@amotl amotl merged commit a13bf28 into main Aug 26, 2025
3 checks passed
@amotl amotl deleted the kafka branch August 26, 2025 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

guidance Matters of layout, shape, and structure. new content New content being added.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants