Skip to content

Service Overview

Dong Lin edited this page Apr 8, 2017 · 10 revisions

The most flexible way to start kafka-monitor is to run kafka-monitor-start.sh with a config file, which allows you to instantiate multiple Service or App that are already implemented in Kafka Monitor and tune their configs to monitor your clusters. Each service has its own thread or scheduler to carry out pre-defined tasks, e.g. produce message or consume message.

In this section we introduce some Service classes that have been implemented in Kafka Monitor. See Service Configuration for their configs.

ProduceService

ProduceService produces messages at regular interval to each partition of a given topic. The topic should be created in advance, and its partition number should ideally be multiple of the broker number in the cluster. By producing messages at regular interval, ProduceService is able to measure the availability of produce service provided by Kafka as a fraction number. Furthermore, the message payload contains incremental integers and timestamp, which can be used by the ConsumeService to measure e.g., end-to-end latency and message loss.

By default, ProduceService uses new producer to produce message. User can implement a thin wrapper around their existing producer implementation that implements interface com.linkedin.kmf.producer.KMBaseProducer, to run ProduceService with their custom producer implementation.

To measure availability of produce service, ProduceService keeps track of the message produce_rate and error_rate. Error_rate will be incremented if an exception is thrown and caught when producer produces messages. Availability is measured as average of per-partition availability. per-partition availability will be measured as produce_rate/(produce_rate + error_rate), if produce_rate > 0; otherwise per-partition availability is 0, since no message is produced in the time window used to measure the rate. By default this time window is 60 seconds.

ConsumeService

ConsumeService consumes messages from a topic. The messages should be produced by ProduceService. Using incremental integers and timestamp provided in the message payload, ConsumeService is able to measure the message loss rate, message duplicated rate, end-to-end latency etc.

ConsumeService has built-in support for old consumer and new consumer. User can run ConsumeService with their choice of consumer and configuration. User can also implement a thin wrapper around their existing consumer implementation that implements interface com.linkedin.kmf.consumer.KMBaseConsumer, to run ConsumeService with their custom consumer implementation.

To measure the message loss rate and message duplicated rate, ProduceService produces messages with integer index in the message payload. This integer index is incremented by 1 for every successful send per partition. ConsumeService reads index from message payload, and compares the index with the last index observed from the same partition, to determine whether there is lost or duplicated message.

To measure end-to-end latency, message payload should contain timestamp at the time the message is constructed. ConsumeService parses the message to obtain the timestamp, and determines the end-to-end latency by subtracting message receive time by this timestamp.

TopicManagementService

TopicManagementService manages the monitor topic of a cluster ensure that every broker is leader of at least one partition of the monitor topic, so that availability metric will drop below 1 if any broker fails. In order to achieve this goal, TopicManagementService monitors the number of brokers, number of partitions, partition assignment across broker and leader distribution across brokers. It may expand partition, reassign partition or trigger preferred leader election if needed.

Clone this wiki locally