`KafkaConsumer`: back pressure + improved read speed #139

felixschlegel · 2023-10-11T21:20:33Z

Motivation:

Closes #131.

Example:

I ran an example consumer reading a topic with approx. 11_000_000 messages:

var consumerConfig = KafkaConsumerConfiguration(
    consumptionStrategy: .partition(
        KafkaPartition(rawValue: 0),
        topic: "test-topic",
        offset: KafkaOffset(rawValue: 0) // Important: Read from beginning!
    ),
    bootstrapBrokerAddresses: [self.bootstrapBrokerAddress]
)
consumerConfig.pollInterval = .zero
consumerConfig.autoOffsetReset = .beginning // Always read topics from beginning
consumerConfig.broker.addressFamily = .v4

let consumer = try KafkaConsumer(
    configuration: consumerConfig,
    logger: .kafkaTest
)

let serviceGroupConfiguration = ServiceGroupConfiguration(services: [consumer], logger: .kafkaTest)
let serviceGroup = ServiceGroup(configuration: serviceGroupConfiguration)

try await withThrowingTaskGroup(of: Void.self) { group in
    // Run Task
    group.addTask {
        try await serviceGroup.run()
    }

    // Consumer Task
    group.addTask {
        var count = 0
        for try await message in consumer.messages {
            _ = message // drop message
            count += 1
            try await Task.sleep(for: .milliseconds(1))
            if count % 1000 == 0 {
                print(count)
            }
        }
    }

    // Wait for Consumer Task to complete
    try await group.next()
    // Shutdown the serviceGroup
    await serviceGroup.triggerGracefulShutdown()
}

Before

(Without back pressure)

After

(With back pressure)

This result is tolerable since the queued.max.messages.kbytes configuration property defaults to prefetching at max ~65 MegaBytes of messages. Exposing queued.max.messages.kbytes will be done in a follow-up PR.

Modifications:

re-add KafkaConsumerConfiguration.backPressureStrategy: BackPressureStrategy, currently allowing users to add high-low-watermark backpressure to their KafkaConsumers
KafkaConsumer:
- make KafkaConsumerMessages use NIOAsyncSequenceProducerBackPressureStrategies.HighLowWatermark as backpressure strategy
- remove rd_kafka_poll_set_consumer -> use two separate queues for consumer events and consumer messages so we can exert backpressure on the consumer message queue
- remove idle polling mechanism where incoming messages were discarded when KafkaConsumerMessages was terminated -> we now have to independent queues
- rename .pollForAndYieldMessage -> .pollForEventsAndMessages
- refactor State and add ConsumerMessagesSequenceState
KafkaProducer:
- rename .consumptionStopped -> .eventConsumptionFinished
RDKafkaClient:
- bring back consumerPoll() * eventPoll(): only queue main queue for events since consumer messages are now handled on a different queue

felixschlegel · 2023-10-11T21:21:37Z

Sources/Kafka/Configuration/KafkaConsumerConfiguration.swift

+    /// See ``KafkaConsumerConfiguration/BackPressureStrategy-swift.struct`` for more information.
+    public var backPressureStrategy: BackPressureStrategy = .watermark(
+        low: 10,
+        high: 50


Are you happy with these default values?

I think these values should work good for a wide range of cases

Fine by me as well. Kafka is all around people tuning the settings to fit their messages.

Motivation: Closes swift-server#131. Modifications: * re-add `KafkaConsumerConfiguration.backPressureStrategy: BackPressureStrategy`, currently allowing users to add high-low-watermark backpressure to their `KafkaConsumer`s * `KafkaConsumer`: * make `KafkaConsumerMessages` use `NIOAsyncSequenceProducerBackPressureStrategies.HighLowWatermark` as backpressure strategy * remove `rd_kafka_poll_set_consumer` -> use two separate queues for consumer events and consumer messages so we can exert backpressure on the consumer message queue * remove idle polling mechanism where incoming messages were discarded when `KafkaConsumerMessages` was terminated -> we now have to independent queues * rename `.pollForAndYieldMessage` -> `.pollForEventsAndMessages` * refactor `State` and add `ConsumerMessagesSequenceState` * `KafkaProducer`: * rename `.consumptionStopped` -> `.eventConsumptionFinished` * `RDKafkaClient`: * bring back `consumerPoll()` * `eventPoll()`: only queue main queue for events since consumer messages are now handled on a different queue

blindspotbounty · 2023-10-12T07:49:46Z

Sources/Kafka/RDKafka/RDKafkaClient.swift

+
+        defer {
+            // Destroy message otherwise poll() will block forever
+            rd_kafka_message_destroy(messagePointer)


Hm.. That is out of scope but probably, we may not destroy this message but rather retain it inside KafkaConsumerMessage thus removing allocations of ByteBuffer and other structures?

Yes, please create a separate issue for that 😄

blindspotbounty · 2023-10-12T07:51:34Z

Sources/Kafka/KafkaConsumer.swift

+                // Poll for new consumer message.
+                var result: Result<KafkaConsumerMessage, Error>?
+                do {
+                    if let message = try client.consumerPoll() {


It seems that previously we would poll for up to 100 messages before sleep and it looks like now it would be just one message.

If I read this correctly we are polling a single message and then yielding a single message right? It would probably be better if we read them in batches of 100 and then yield all of them at once. Would mean we have to acquire the locks a lot less.

Not sure it is entirely what I mean. This comment was related to changes before introducing task group.
There was a sleep after every consumed message.
So, it is not relevant now except probably this part: https://github.com/swift-server/swift-kafka-client/pull/139/files#r1360663871

blindspotbounty · 2023-10-12T07:56:23Z

Sources/Kafka/Configuration/KafkaConsumerConfiguration.swift

+    /// See ``KafkaConsumerConfiguration/BackPressureStrategy-swift.struct`` for more information.
+    public var backPressureStrategy: BackPressureStrategy = .watermark(
+        low: 10,
+        high: 50


I think these values should work good for a wide range of cases

FranzBusch · 2023-10-12T09:10:33Z

Sources/Kafka/Configuration/KafkaConsumerConfiguration.swift

+    /// See ``KafkaConsumerConfiguration/BackPressureStrategy-swift.struct`` for more information.
+    public var backPressureStrategy: BackPressureStrategy = .watermark(
+        low: 10,
+        high: 50


Fine by me as well. Kafka is all around people tuning the settings to fit their messages.

FranzBusch · 2023-10-12T09:22:04Z

Sources/Kafka/KafkaConsumer.swift

+                // Poll for new consumer message.
+                var result: Result<KafkaConsumerMessage, Error>?
+                do {
+                    if let message = try client.consumerPoll() {


We have to do more here than just calling the two separate polls. We probably want to serve the consumer poll in a child task and make it completely driven by the backpressure of the async sequence. However, we need to keep shutdown in mind so we need to inform the child task about this.

Saying all of that I am not yet convinced it is the right solution. The one question that I am asking myself right now is how do we get notified when the consumer queue has more messages again when we do the poll based and decouple the two queue polls. Is there a callback in rdkafka that we can set once we get nil from consumerPoll so that we can enqueue ourselves?

I believe there is no such notification in librdkafka. I guess the desired thing to do with native API is to use blocking call for consumer poll which contradicts with swift concurrency contract.
From the librdkafka sample (https://github.com/confluentinc/librdkafka/blob/master/examples/consumer.c#L209C17-L209C55):

for (;;) { const auto * msg = rd_kafka_consumer_poll(consumer.kafkaHandle, 100 /* 100ms timeout */); if msg == nullptr { continue } // ... }

Other thought after confirming with librdkafka documentation is that all callbacks are called on queue's polls (https://github.com/confluentinc/librdkafka/blob/master/INTRODUCTION.md#threads-and-callbacks). That, unfortunately, makes any callbacks without polls useless.

I guess making dedicated thread for blocking polls is not inline with swift evolution. However, with swift concurrency probably it is okay to sleep/yield task for some configurable timeout (e.g. 10ms) if we have no more messages to read otherwise continue to read until bump into backpressure. Additionally, it is possible to make poll intervals adaptive depending on message flow (#128).

Modifications: * have two state machines: 1. consumer state itself 2. state of consumer messages async sequence

felixschlegel · 2023-10-16T00:10:55Z

Hey folks,

I have implemented your requested changes:

have two poll loops, one for normal events (every pollInterval time units) and one for consumer messages (read as long as you can without sleeping)

The poll loop for consumer messages follows a similar approach to what was proposed in #128 .

Also, I want to add a benchmark test evaluating that we achieve good memory usage (back pressure) and good read speeds with this implementation so it would be good if we can get #140 over the line 😄

Please let me know what you think!

Best,
Felix

FranzBusch · 2023-10-16T10:01:38Z

Sources/Kafka/KafkaConsumer.swift

+            try await group.next()
+            try await group.next()


We probably want to do a group.cancelAll() after the first returned without throwing.

I'm not sure, what about the following case:

When the client application stops reading the KafkaConsumerMessages sequence the messageRunLoop() will return. However, the eventRunLoop() might still be processing a consumer close so I don't see a benefit in cancelling here, but open to discussion

FranzBusch · 2023-10-16T10:03:12Z

Sources/Kafka/KafkaConsumer.swift

+                    case .stopProducing:
+                        self.stateMachine.withLockedValue { $0.stopProducing() }
+                    case .dropped:
+                        break


Here we probably want to return right?

FranzBusch · 2023-10-16T10:08:35Z

Sources/Kafka/KafkaConsumer.swift

+                // Poll for new consumer message.
+                var result: Result<KafkaConsumerMessage, Error>?
+                do {
+                    if let message = try client.consumerPoll() {


If I read this correctly we are polling a single message and then yielding a single message right? It would probably be better if we read them in batches of 100 and then yield all of them at once. Would mean we have to acquire the locks a lot less.

blindspotbounty · 2023-10-16T13:11:02Z

Sources/Kafka/KafkaProducerEvent.swift

@@ -23,8 +23,6 @@ public enum KafkaProducerEvent: Sendable, Hashable {
        switch event {
        case .deliveryReport(results: let results):


Hm.. I guess it is out of scope here but should we have backpressure for delivery reports as well?

I don't think we can since they happen on the eventsQueue unless there is a way to separate events to different queues.

Yes, the problem is that we should call poll() at regular intervals to serve any queued callbacks/events. As @FranzBusch points out, we would need librdkafka to have a separate "delivery reports queue" for backpressure to make sense here since some events like log events should still be served even when our events AsyncSequence with all the .deliveryReport events is suspended.

Ah, yes, sure... We can only separate log queue from others...
So, we only can make some "unfair" backpressue, i.e. sometimes poll even if receive stopProducing which is dirty solution...

Yes, we could separate the log queue but then there is also other events like commit confirmation

Yeah, I got it, we may close this discussion, probably to think about it with separate case as it is out of scope for this PR anyway and might be not a problem (at least far).

blindspotbounty · 2023-10-16T13:25:27Z

Sources/Kafka/KafkaConsumer.swift

+                        break
+                    }
+
+                    self.stateMachine.withLockedValue { $0.newMessagesProduced() }


I guess It might make sense to continue without sleep after message was produced until we will out of messages again or bump into backpressure.

blindspotbounty · 2023-10-17T08:02:43Z

Sources/Kafka/KafkaConsumer.swift

+            case .pollForEvents(let client):
+                // Event poll to serve any events queued inside of `librdkafka`.
+                _ = client.eventPoll()
+                try await Task.sleep(for: self.configuration.pollInterval)


That is obviously out of scope regarding sleeping here and likely should be done separately.
What do you think about symmetry with consumer polls thus sleeping only in case we out of events otherwise continue to poll events?

Yes, let's do this in a follow-up PR! This probably relates to the #128 issue

blindspotbounty · 2023-10-18T16:03:44Z

Sources/Kafka/RDKafka/RDKafkaClient.swift

+        }
+
+        // Reached the end of the topic+partition queue on the broker
+        if messagePointer.pointee.err == RD_KAFKA_RESP_ERR__PARTITION_EOF {


Hm, I wonder if now all errors will come here or some of them will be received in eventsPoll()

I think there is going to be a clear split now between message related errors and run loop errors.

blindspotbounty · 2023-10-19T10:02:08Z

Sources/Kafka/KafkaConsumer.swift

+                    case .produceMore:
+                        break
+                    case .stopProducing:
+                        self.stateMachine.withLockedValue { $0.stopProducing() }


It seems If we suspend producing here, we should not reach newMessagesProduced, otherwise we bump into fatalError:
https://github.com/swift-server/swift-kafka-client/pull/139/files/6686e982c3c1c905acc0a6f4653cdf903f6738ff#diff-8ba8c17c39d40fc29ec22c104d01bacc928df36e717000547f2a97d4203ef6a1R797

Yes indeed, very good catch!

Modifications: * `KafkaConsumer`: * end consumer message poll loop when async sequence drops message * do not sleep if we picked up reading new messages again after we finished reading a partition * `messageRunLoop`: * fix `fatalError` where `newMessagesProduced()` is invoked after `stopProducing()` * add func `batchConsumerPoll` that reads a batch of messages to avoid acquiring the lock in `messageRunLoop` too often

felixschlegel · 2023-10-20T12:51:28Z

We are failing on 5.10 because OpaquePointer is not Sendable — is that a deliberate change to Swift 5.10? The documentation still states that it should be Sendable

blindspotbounty · 2023-10-24T13:23:27Z

Sources/Kafka/KafkaConsumer.swift

+    /// - Parameters:
+    ///     - client: Client used for handling the connection to the Kafka cluster.
+    ///     - maxMessages: Maximum amount of consumer messages to read in this invocation.
+    private func batchConsumerPoll(


Sorry for being picky and it is probably up to fine tuning but there is a native method in librdkafka that allows to get a batch within one poll:

RD_EXPORT ssize_t rd_kafka_consume_batch(rd_kafka_topic_t *rkt, int32_t partition, int timeout_ms, rd_kafka_message_t **rkmessages, size_t rkmessages_size);

UPD: plus it could be too low level...

Let's track that in a follow up issue

blindspotbounty · 2023-10-26T17:02:25Z

Sources/Kafka/KafkaConsumer.swift

+    ///     - maxMessages: Maximum amount of consumer messages to read in this invocation.
+    private func batchConsumerPoll(
+        client: RDKafkaClient,
+        maxMessages: Int = 100


We probably should provide this argument from poll based on backpressure. Currently, we have 50 messages high watermark as default value for backpressure, so we may break backpressure as twice if read 100 messages and enqueue them to stream.
Not sure what is right here but maybe we can call it with one of those high or low watermark limits?

Yeah we should probably set this to the difference of high - low.

FranzBusch

Overall, this looks good to me now and I think we should go-ahead and merge it and then see if we encounter any problems with it.

felixschlegel commented Oct 11, 2023

View reviewed changes

felixschlegel force-pushed the fs-kafka-backpressure branch from f1c2b5c to ff7555c Compare October 11, 2023 21:22

blindspotbounty reviewed Oct 12, 2023

View reviewed changes

FranzBusch reviewed Oct 12, 2023

View reviewed changes

felixschlegel added 3 commits October 14, 2023 13:13

KafkaConsumer: two state machines

8d23aa6

Modifications: * have two state machines: 1. consumer state itself 2. state of consumer messages async sequence

KafkaConsumer: merge both state machines

848b8c3

Refactor + DocC

6686e98

felixschlegel changed the title ~~Add Back Pressure to KafkaConsumer~~ KafkaConsumer: back pressure + improved read speed Oct 16, 2023

felixschlegel requested review from FranzBusch and blindspotbounty October 16, 2023 00:11

FranzBusch reviewed Oct 16, 2023

View reviewed changes

blindspotbounty reviewed Oct 16, 2023

View reviewed changes

blindspotbounty reviewed Oct 17, 2023

View reviewed changes

blindspotbounty reviewed Oct 18, 2023

View reviewed changes

blindspotbounty reviewed Oct 19, 2023

View reviewed changes

felixschlegel requested review from FranzBusch and blindspotbounty October 20, 2023 12:40

blindspotbounty approved these changes Oct 24, 2023

View reviewed changes

blindspotbounty reviewed Oct 26, 2023

View reviewed changes

FranzBusch approved these changes Oct 31, 2023

View reviewed changes

FranzBusch merged commit f1800c2 into swift-server:main Nov 1, 2023

blindspotbounty mentioned this pull request Nov 3, 2023

Feature: expose librdkafka statistics as swift metrics #92

Merged

blindspotbounty mentioned this pull request Nov 13, 2023

Poll batches according to backpressure #145

Closed

blindspotbounty mentioned this pull request Nov 29, 2023

A lot of idling in KafkaConsumer.batchConsumerPoll #151

Closed

blindspotbounty mentioned this pull request Dec 11, 2023

Wrap rd_kafka_consumer_poll into iterator (use librdkafka embedded backpressure) #158

Merged

		@@ -23,8 +23,6 @@ public enum KafkaProducerEvent: Sendable, Hashable {
		switch event {
		case .deliveryReport(results: let results):

KafkaConsumer: back pressure + improved read speed #139

KafkaConsumer: back pressure + improved read speed #139

Uh oh!

Conversation

felixschlegel commented Oct 11, 2023

Motivation:

Example:

Before

After

Modifications:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixschlegel commented Oct 16, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixschlegel commented Oct 20, 2023

Uh oh!

blindspotbounty Oct 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

`KafkaConsumer`: back pressure + improved read speed #139

`KafkaConsumer`: back pressure + improved read speed #139

blindspotbounty Oct 24, 2023 •

edited

Loading