From c482c56b2a9cade45aadd8a3c0bed40977cf4b71 Mon Sep 17 00:00:00 2001 From: Reinaldy Rafli Date: Sat, 4 Oct 2025 10:29:38 +0700 Subject: [PATCH 01/10] docs(self-hosted): provide more insights on troubleshooting kafka Turns out most of our self-hosted users has never touch Kafka before, so it's a good idea to introduce them regarding how Kafka works. Also added how to increase consumers replica if they're lagging behind. --- .../self-hosted/troubleshooting/kafka.mdx | 73 +++++++++++++++---- 1 file changed, 58 insertions(+), 15 deletions(-) diff --git a/develop-docs/self-hosted/troubleshooting/kafka.mdx b/develop-docs/self-hosted/troubleshooting/kafka.mdx index 1cfec5a00f7eff..ee6192d931e4cd 100644 --- a/develop-docs/self-hosted/troubleshooting/kafka.mdx +++ b/develop-docs/self-hosted/troubleshooting/kafka.mdx @@ -4,19 +4,19 @@ sidebar_title: Kafka sidebar_order: 2 --- -## Offset Out Of Range Error +## How Kafka Works -```log -Exception: KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"} -``` +This section is aimed for those who has Kafka problems, yet not familiar with Kafka. At a high level, it is a message broker which stores message in a log (or in an easier language: very similar to array) format. It receives messages from producers that aimed to a specific topic, and then sends them to consumers that are subscribed to that topic. The consumers can then process the messages. -This happens where Kafka and the consumers get out of sync. Possible reasons are: +On the inside, when a message enters a topic, it would be written to a certain partition. You can think partition as physical boxes that stores messages for a specific topic, each topic will have their own separate & dedicated partitions. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine. -1. Running out of disk space or memory -2. Having a sustained event spike that causes very long processing times, causing Kafka to drop messages as they go past the retention time -3. Date/time out of sync issues due to a restart or suspend/resume cycle +When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that the number of consumers must not exceed the number of partition for a given topic. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume. + +Each messages in a topic will then have an "offset" (number), this would easily translates to "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the times, we want "lag" to be as low as possible, meaning we don't want to have so many unprocessed messages. The easy solution would be adding more partitions and increasing the number of consumers. -### Visualize +The differences with other types of queues or brokers like RabbitMQ or Redis is that Kafka has a concept called "retention time". Messages that are stored on Kafka and consumed by consumers won't be deleted immediately. Instead, they will be stored for a certain period of time. By default, self-hosted Sentry uses Kafka with a retention time of 24 hours. This means that messages that are older than 24 hours will be deleted. If you want to change the retention time, you can do so by modifying the `KAFKA_LOG_RETENTION_HOURS` environment variable in the `kafka` service. + +### Visualize Kafka You can visualize the Kafka consumers and their offsets by bringing an additional container, such as [Kafka UI](https://github.com/provectus/kafka-ui) or [Redpanda Console](https://github.com/redpanda-data/console) into your Docker Compose. @@ -59,6 +59,20 @@ redpanda-console: - kafka ``` +It's recommended to put this on `docker-compose.override.yml` rather than modifying your `docker-compose.yml` directly. The UI will then can be accessed via `http://localhost:8080/` (or `http://:8080/` if you're using a reverse proxy). + +## Offset Out Of Range Error + +```log +Exception: KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"} +``` + +This happens where Kafka and the consumers get out of sync. Possible reasons are: + +1. Running out of disk space or memory +2. Having a sustained event spike that causes very long processing times, causing Kafka to drop messages as they go past the retention time +3. Date/time out of sync issues due to a restart or suspend/resume cycle + Ideally, you want to have zero lag for all consumer groups. If a consumer group has a lot of lag, you need to investigate whether it's caused by a disconnected consumer (e.g., a Sentry/Snuba container that's disconnected from Kafka) or a consumer that's stuck processing a certain message. If it's a disconnected consumer, you can either restart the container or reset the Kafka offset to 'earliest.' Otherwise, you can reset the Kafka offset to 'latest.' ### Recovery @@ -77,19 +91,19 @@ The _proper_ solution is as follows ([reported](https://github.com/getsentry/sel ``` 2. Receive consumers list: ```shell - docker compose run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --list + docker compose exec kafka kafka-consumer-groups --bootstrap-server kafka:9092 --list ``` 3. Get group info: ```shell - docker compose run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --group snuba-consumers --describe + docker compose exec kafka kafka-consumer-groups --bootstrap-server kafka:9092 --group snuba-consumers --describe ``` 4. Watching what is going to happen with offset by using dry-run (optional): ```shell - docker compose run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --group snuba-consumers --topic events --reset-offsets --to-latest --dry-run + docker compose exec kafka kafka-consumer-groups --bootstrap-server kafka:9092 --group snuba-consumers --topic events --reset-offsets --to-latest --dry-run ``` 5. Set offset to latest and execute: ```shell - docker compose run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --group snuba-consumers --topic events --reset-offsets --to-latest --execute + docker compose exec kafka kafka-consumer-groups --bootstrap-server kafka:9092 --group snuba-consumers --topic events --reset-offsets --to-latest --execute ``` 6. Start the previously stopped Sentry/Snuba containers: ```shell @@ -107,14 +121,16 @@ This option is as follows ([reported](https://github.com/getsentry/self-hosted/i 1. Set offset to latest and execute: ```shell - docker compose run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --all-groups --all-topics --reset-offsets --to-latest --execute + docker compose exec kafka kafka-consumer-groups --bootstrap-server kafka:9092 --all-groups --all-topics --reset-offsets --to-latest --execute ``` Unlike the proper solution, this involves resetting the offsets of all consumer groups and all topics. #### Nuclear option -The _nuclear option_ is removing all Kafka-related volumes and recreating them which _will_ cause data loss. Any data that was pending there will be gone upon deleting these volumes. + + The _nuclear option_ is removing all Kafka-related volumes and recreating them which _will_ cause data loss. Any data that was pending there will be gone upon deleting these volumes. + 1. Stop the instance: ```shell @@ -133,6 +149,33 @@ The _nuclear option_ is removing all Kafka-related volumes and recreating them w ```shell docker compose up --wait ``` + +## Consumers Lagging Behind + +If you notice a very slow ingestion speed and consumers are lagging behind, it's likely that the consumers are not able to keep up with the producers. This can happen if the consumers are not able to keep up with the rate of messages being produced. To fix this, you can increase the number of partitions and increase the number of consumers. + +1. For example, if you see `ingest-consumer` consumer group has a lot of lag, and you can see that it's subscribed to `ingest-events` topic, then you need to first increase the number of partitions for that topic. + ```bash + docker compose exec kafka kafka-topics --bootstrap-server kafka:9092 --alter --partitions 3 --topic ingest-events + ``` +2. Validate that the number of partitions for the topic is now 3. + ```bash + docker compose exec kafka kafka-topics --bootstrap-server kafka:9092 --describe --topic ingest-events + ``` +3. Then, you need to increase the number of consumers for the consumer group. You can see on the `docker-compose.yml` that the container that consumes `ingest-events` topic using ` ingest-consumer` consumer group is `events-consumer` container. But we won't modify the `docker-compose.yml` directly, instead, we will create a new file called `docker-compose.override.yml` and add the following: + ```yaml + services: + events-consumer: + deploy: + replicas: 3 + ``` + This will increase the number of consumers for the `ingest-consumer` consumer group to 3. +4. Finally, you need to refresh the `events-consumer` container. You can do so by running the following command: + ```bash + docker compose up -d --wait events-consumer + ``` +5. Observe the logs of `events-consumer`, you should not see any consumer errors. Let it run for a while (usually a few minutes until a few hours) and observe the Kafka topic lags. + ## Reducing disk usage If you want to reduce the disk space used by Kafka, you'll need to carefully calculate how much data you are ingesting, how much data loss you can tolerate and then follow the recommendations on [this awesome StackOverflow post](https://stackoverflow.com/a/52970982/90297) or [this post on our community forum](https://forum.sentry.io/t/sentry-disk-cleanup-kafka/11337/2?u=byk). From 373de2c86680ba37de8881693d0abf633faa3c0a Mon Sep 17 00:00:00 2001 From: Reinaldy Rafli Date: Sat, 4 Oct 2025 20:37:41 +0700 Subject: [PATCH 02/10] Update kafka.mdx Co-authored-by: Kevin Pfeifer --- develop-docs/self-hosted/troubleshooting/kafka.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/develop-docs/self-hosted/troubleshooting/kafka.mdx b/develop-docs/self-hosted/troubleshooting/kafka.mdx index ee6192d931e4cd..dd4e7ab575740e 100644 --- a/develop-docs/self-hosted/troubleshooting/kafka.mdx +++ b/develop-docs/self-hosted/troubleshooting/kafka.mdx @@ -6,7 +6,7 @@ sidebar_order: 2 ## How Kafka Works -This section is aimed for those who has Kafka problems, yet not familiar with Kafka. At a high level, it is a message broker which stores message in a log (or in an easier language: very similar to array) format. It receives messages from producers that aimed to a specific topic, and then sends them to consumers that are subscribed to that topic. The consumers can then process the messages. +This section is aimed for those who have Kafka problems, but are not yet familiar with Kafka. At a high level, it is a message broker which stores message in a log (or in an easier language: very similar to an array) format. It receives messages from producers that aimed to a specific topic, and then sends them to consumers that are subscribed to that topic. The consumers can then process the messages. On the inside, when a message enters a topic, it would be written to a certain partition. You can think partition as physical boxes that stores messages for a specific topic, each topic will have their own separate & dedicated partitions. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine. From b2803602dbbfc697bb50a21c287e39afc9719af3 Mon Sep 17 00:00:00 2001 From: Reinaldy Rafli Date: Sat, 4 Oct 2025 20:38:00 +0700 Subject: [PATCH 03/10] Update kafka.mdx Co-authored-by: Kevin Pfeifer --- develop-docs/self-hosted/troubleshooting/kafka.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/develop-docs/self-hosted/troubleshooting/kafka.mdx b/develop-docs/self-hosted/troubleshooting/kafka.mdx index dd4e7ab575740e..3a04ccb699dc2b 100644 --- a/develop-docs/self-hosted/troubleshooting/kafka.mdx +++ b/develop-docs/self-hosted/troubleshooting/kafka.mdx @@ -10,7 +10,7 @@ This section is aimed for those who have Kafka problems, but are not yet familia On the inside, when a message enters a topic, it would be written to a certain partition. You can think partition as physical boxes that stores messages for a specific topic, each topic will have their own separate & dedicated partitions. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine. -When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that the number of consumers must not exceed the number of partition for a given topic. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume. +When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume. Each messages in a topic will then have an "offset" (number), this would easily translates to "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the times, we want "lag" to be as low as possible, meaning we don't want to have so many unprocessed messages. The easy solution would be adding more partitions and increasing the number of consumers. From 13365e070bb164d9eb0d830cebd51241ede152d8 Mon Sep 17 00:00:00 2001 From: Reinaldy Rafli Date: Tue, 7 Oct 2025 17:17:04 +0700 Subject: [PATCH 04/10] Apply suggestion from @jjbayer Co-authored-by: Joris Bayer --- develop-docs/self-hosted/troubleshooting/kafka.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/develop-docs/self-hosted/troubleshooting/kafka.mdx b/develop-docs/self-hosted/troubleshooting/kafka.mdx index 3a04ccb699dc2b..78e95be83e5c03 100644 --- a/develop-docs/self-hosted/troubleshooting/kafka.mdx +++ b/develop-docs/self-hosted/troubleshooting/kafka.mdx @@ -6,7 +6,7 @@ sidebar_order: 2 ## How Kafka Works -This section is aimed for those who have Kafka problems, but are not yet familiar with Kafka. At a high level, it is a message broker which stores message in a log (or in an easier language: very similar to an array) format. It receives messages from producers that aimed to a specific topic, and then sends them to consumers that are subscribed to that topic. The consumers can then process the messages. +This section is aimed for those who have Kafka problems, but are not yet familiar with Kafka. At a high level, Kafka is a message broker which stores messages in a log (or in an easier language: very similar to an array) format. It receives messages from producers that write to a specific topic, and then sends them to consumers that are subscribed to that topic. The consumers can then process the messages. On the inside, when a message enters a topic, it would be written to a certain partition. You can think partition as physical boxes that stores messages for a specific topic, each topic will have their own separate & dedicated partitions. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine. From 484e104ef2f850c2c8d488cfded1f86e45bdad8b Mon Sep 17 00:00:00 2001 From: Reinaldy Rafli Date: Tue, 7 Oct 2025 17:17:12 +0700 Subject: [PATCH 05/10] Apply suggestion from @jjbayer Co-authored-by: Joris Bayer --- develop-docs/self-hosted/troubleshooting/kafka.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/develop-docs/self-hosted/troubleshooting/kafka.mdx b/develop-docs/self-hosted/troubleshooting/kafka.mdx index 78e95be83e5c03..b81268b258e77c 100644 --- a/develop-docs/self-hosted/troubleshooting/kafka.mdx +++ b/develop-docs/self-hosted/troubleshooting/kafka.mdx @@ -10,7 +10,7 @@ This section is aimed for those who have Kafka problems, but are not yet familia On the inside, when a message enters a topic, it would be written to a certain partition. You can think partition as physical boxes that stores messages for a specific topic, each topic will have their own separate & dedicated partitions. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine. -When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume. +When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume. Each messages in a topic will then have an "offset" (number), this would easily translates to "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the times, we want "lag" to be as low as possible, meaning we don't want to have so many unprocessed messages. The easy solution would be adding more partitions and increasing the number of consumers. From 669bc82b41635d72424585bad133fa0c569259df Mon Sep 17 00:00:00 2001 From: Reinaldy Rafli Date: Wed, 8 Oct 2025 07:09:29 +0700 Subject: [PATCH 06/10] Update develop-docs/self-hosted/troubleshooting/kafka.mdx Co-authored-by: Shannon Anahata --- develop-docs/self-hosted/troubleshooting/kafka.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/develop-docs/self-hosted/troubleshooting/kafka.mdx b/develop-docs/self-hosted/troubleshooting/kafka.mdx index b81268b258e77c..8995a0b9998c50 100644 --- a/develop-docs/self-hosted/troubleshooting/kafka.mdx +++ b/develop-docs/self-hosted/troubleshooting/kafka.mdx @@ -6,7 +6,7 @@ sidebar_order: 2 ## How Kafka Works -This section is aimed for those who have Kafka problems, but are not yet familiar with Kafka. At a high level, Kafka is a message broker which stores messages in a log (or in an easier language: very similar to an array) format. It receives messages from producers that write to a specific topic, and then sends them to consumers that are subscribed to that topic. The consumers can then process the messages. +This section is aimed for those who have Kafka problems, but are not yet familiar with Kafka. At a high level, Kafka is a message broker which stores messages in a log, very similar to an array, format. It receives messages from producers that write to a specific topic, and then sends them to consumers that are subscribed to that topic. The consumers can then process the messages. On the inside, when a message enters a topic, it would be written to a certain partition. You can think partition as physical boxes that stores messages for a specific topic, each topic will have their own separate & dedicated partitions. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine. From b23c4325aef943871f637fdb11f4512d17ae3f60 Mon Sep 17 00:00:00 2001 From: Reinaldy Rafli Date: Fri, 31 Oct 2025 15:52:51 +0700 Subject: [PATCH 07/10] Update develop-docs/self-hosted/troubleshooting/kafka.mdx Co-authored-by: Shannon Anahata --- develop-docs/self-hosted/troubleshooting/kafka.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/develop-docs/self-hosted/troubleshooting/kafka.mdx b/develop-docs/self-hosted/troubleshooting/kafka.mdx index 8995a0b9998c50..438eb64ea6431e 100644 --- a/develop-docs/self-hosted/troubleshooting/kafka.mdx +++ b/develop-docs/self-hosted/troubleshooting/kafka.mdx @@ -8,7 +8,7 @@ sidebar_order: 2 This section is aimed for those who have Kafka problems, but are not yet familiar with Kafka. At a high level, Kafka is a message broker which stores messages in a log, very similar to an array, format. It receives messages from producers that write to a specific topic, and then sends them to consumers that are subscribed to that topic. The consumers can then process the messages. -On the inside, when a message enters a topic, it would be written to a certain partition. You can think partition as physical boxes that stores messages for a specific topic, each topic will have their own separate & dedicated partitions. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine. +On the inside, when a message enters a topic, it will be written to a certain partition. You can think of a partition as physical a box that stores messages for a specific topic. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine. When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume. From fd32f39364fc0766b255c0117eddcca28747b676 Mon Sep 17 00:00:00 2001 From: Reinaldy Rafli Date: Fri, 31 Oct 2025 15:53:05 +0700 Subject: [PATCH 08/10] Update develop-docs/self-hosted/troubleshooting/kafka.mdx Co-authored-by: Shannon Anahata --- develop-docs/self-hosted/troubleshooting/kafka.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/develop-docs/self-hosted/troubleshooting/kafka.mdx b/develop-docs/self-hosted/troubleshooting/kafka.mdx index 438eb64ea6431e..15dfc5f31d5504 100644 --- a/develop-docs/self-hosted/troubleshooting/kafka.mdx +++ b/develop-docs/self-hosted/troubleshooting/kafka.mdx @@ -12,7 +12,7 @@ On the inside, when a message enters a topic, it will be written to a certain pa When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume. -Each messages in a topic will then have an "offset" (number), this would easily translates to "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the times, we want "lag" to be as low as possible, meaning we don't want to have so many unprocessed messages. The easy solution would be adding more partitions and increasing the number of consumers. +Each message in a topic will have an "offset" (number). You can think of this like an "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the time, we want "lag" to be as low as possible. The easiest solution to lagging is adding more partitions and increasing the number of consumers. The differences with other types of queues or brokers like RabbitMQ or Redis is that Kafka has a concept called "retention time". Messages that are stored on Kafka and consumed by consumers won't be deleted immediately. Instead, they will be stored for a certain period of time. By default, self-hosted Sentry uses Kafka with a retention time of 24 hours. This means that messages that are older than 24 hours will be deleted. If you want to change the retention time, you can do so by modifying the `KAFKA_LOG_RETENTION_HOURS` environment variable in the `kafka` service. From 19a0041d7f3e6374081649c7934cbbcfe7a02ff7 Mon Sep 17 00:00:00 2001 From: Reinaldy Rafli Date: Fri, 31 Oct 2025 15:53:36 +0700 Subject: [PATCH 09/10] Update develop-docs/self-hosted/troubleshooting/kafka.mdx Co-authored-by: Shannon Anahata --- develop-docs/self-hosted/troubleshooting/kafka.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/develop-docs/self-hosted/troubleshooting/kafka.mdx b/develop-docs/self-hosted/troubleshooting/kafka.mdx index 15dfc5f31d5504..34ea0bead96248 100644 --- a/develop-docs/self-hosted/troubleshooting/kafka.mdx +++ b/develop-docs/self-hosted/troubleshooting/kafka.mdx @@ -10,7 +10,7 @@ This section is aimed for those who have Kafka problems, but are not yet familia On the inside, when a message enters a topic, it will be written to a certain partition. You can think of a partition as physical a box that stores messages for a specific topic. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine. -When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume. +When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. **Important to note: the number of consumers cannot exceed the number of partitions**. If you have more consumers than partitions, the extra consumers will receive no messages. Each message in a topic will have an "offset" (number). You can think of this like an "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the time, we want "lag" to be as low as possible. The easiest solution to lagging is adding more partitions and increasing the number of consumers. From 7842c72e83cba3efe43e372cab9ef1f6a79d983a Mon Sep 17 00:00:00 2001 From: Reinaldy Rafli Date: Sun, 2 Nov 2025 09:50:25 +0700 Subject: [PATCH 10/10] docs(self-hosted): Kafka troubleshooting from review comments --- develop-docs/self-hosted/troubleshooting/kafka.mdx | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/develop-docs/self-hosted/troubleshooting/kafka.mdx b/develop-docs/self-hosted/troubleshooting/kafka.mdx index 34ea0bead96248..315b9264c01139 100644 --- a/develop-docs/self-hosted/troubleshooting/kafka.mdx +++ b/develop-docs/self-hosted/troubleshooting/kafka.mdx @@ -10,9 +10,9 @@ This section is aimed for those who have Kafka problems, but are not yet familia On the inside, when a message enters a topic, it will be written to a certain partition. You can think of a partition as physical a box that stores messages for a specific topic. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine. -When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. **Important to note: the number of consumers cannot exceed the number of partitions**. If you have more consumers than partitions, the extra consumers will receive no messages. +When a producer sends a message to a topic, it will either stick to a certain partition number based on the partition key (example: partition 1, partition 2, etc.) or it will choose a partition in a round-robin manner. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. **Important to note: the number of consumers cannot exceed the number of partitions**. If you have more consumers than partitions, the extra consumers will receive no messages. -Each message in a topic will have an "offset" (number). You can think of this like an "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the time, we want "lag" to be as low as possible. The easiest solution to lagging is adding more partitions and increasing the number of consumers. +Each message in a topic will have an "offset" (number). You can think of this like an "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. Offsets are scoped to a partition, therefore a partition in a topic can have the same offset numbers. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the time, we want "lag" to be as low as possible. The easiest solution to lagging is adding more partitions and increasing the number of consumers. The differences with other types of queues or brokers like RabbitMQ or Redis is that Kafka has a concept called "retention time". Messages that are stored on Kafka and consumed by consumers won't be deleted immediately. Instead, they will be stored for a certain period of time. By default, self-hosted Sentry uses Kafka with a retention time of 24 hours. This means that messages that are older than 24 hours will be deleted. If you want to change the retention time, you can do so by modifying the `KAFKA_LOG_RETENTION_HOURS` environment variable in the `kafka` service. @@ -75,6 +75,10 @@ This happens where Kafka and the consumers get out of sync. Possible reasons are Ideally, you want to have zero lag for all consumer groups. If a consumer group has a lot of lag, you need to investigate whether it's caused by a disconnected consumer (e.g., a Sentry/Snuba container that's disconnected from Kafka) or a consumer that's stuck processing a certain message. If it's a disconnected consumer, you can either restart the container or reset the Kafka offset to 'earliest.' Otherwise, you can reset the Kafka offset to 'latest.' + +Choose "earliest" if you want to start re-processing events from the beginning. Choose "latest" if you are okay with losing old events and want to start processing from the newest events. + + ### Recovery @@ -176,6 +180,10 @@ If you notice a very slow ingestion speed and consumers are lagging behind, it's ``` 5. Observe the logs of `events-consumer`, you should not see any consumer errors. Let it run for a while (usually a few minutes until a few hours) and observe the Kafka topic lags. + +The definition of "normal lag" varies depending on your system resources. If you are running a small instance, you can expect a normal lag of around hundreds of messages. If you are running a large instance, you can expect a normal lag of around thousands of messages. + + ## Reducing disk usage If you want to reduce the disk space used by Kafka, you'll need to carefully calculate how much data you are ingesting, how much data loss you can tolerate and then follow the recommendations on [this awesome StackOverflow post](https://stackoverflow.com/a/52970982/90297) or [this post on our community forum](https://forum.sentry.io/t/sentry-disk-cleanup-kafka/11337/2?u=byk).