Skip to content

Commit 86da615

Browse files
committed
monitoring: Add documentation about throughput health check
ref: fluent/fluent-bit#5773 Signed-off-by: Thiago Padilha <[email protected]>
1 parent 641e169 commit 86da615

File tree

1 file changed

+28
-8
lines changed

1 file changed

+28
-8
lines changed

administration/monitoring.md

Lines changed: 28 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -218,14 +218,19 @@ Sample alerts are available [here](https://github.com/fluent/fluent-bit-docs/tre
218218

219219
## Health Check for Fluent Bit
220220

221-
Fluent bit now supports four new configs to set up the health check.
222-
223-
| Config Name | Description | Default Value |
224-
| ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| ------------- |
225-
| Health_Check | enable Health check feature | Off |
226-
| HC_Errors_Count | the error count to meet the unhealthy requirement, this is a sum for all output plugins in a defined HC_Period, example for output error: ` [2022/02/16 10:44:10] [ warn] [engine] failed to flush chunk '1-1645008245.491540684.flb', retry in 7 seconds: task_id=0, input=forward.1 > output=cloudwatch_logs.3 (out_id=3)` | 5 |
227-
| HC_Retry_Failure_Count | the retry failure count to meet the unhealthy requirement, this is a sum for all output plugins in a defined HC_Period, example for retry failure: `[2022/02/16 20:11:36] [ warn] [engine] chunk '1-1645042288.260516436.flb' cannot be retried: task_id=0, input=tcp.3 > output=cloudwatch_logs.1 ` | 5 |
228-
| HC_Period | The time period by second to count the error and retry failure data point | 60 |
221+
Fluent bit supports nine configs to set up the health check.
222+
223+
| Config Name | Description | Default Value |
224+
| ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| ------------- |
225+
| Health_Check | enable Health check feature | Off |
226+
| HC_Errors_Count | the error count to meet the unhealthy requirement, this is a sum for all output plugins in a defined HC_Period, example for output error: ` [2022/02/16 10:44:10] [ warn] [engine] failed to flush chunk '1-1645008245.491540684.flb', retry in 7 seconds: task_id=0, input=forward.1 > output=cloudwatch_logs.3 (out_id=3)` | 5 |
227+
| HC_Retry_Failure_Count | the retry failure count to meet the unhealthy requirement, this is a sum for all output plugins in a defined HC_Period, example for retry failure: `[2022/02/16 20:11:36] [ warn] [engine] chunk '1-1645042288.260516436.flb' cannot be retried: task_id=0, input=tcp.3 > output=cloudwatch_logs.1 ` | 5 |
228+
| HC_Period | The time period by second to count the error and retry failure data point | 60 |
229+
| HC_Throughput | Enable throughput health checking (more details below). In this context, throughput means `OUTPUT_RATE/INPUT_RATE` ratio, and the check happens in accordance to `Hc_Period`. If this is "On", then all other related options must be set since there are no default values. | Off |
230+
| HC_Throughput_Input_Plugins | Comma separated list of input plugins used for the purposes of calculating input rate. | - |
231+
| HC_Throughput_Output_Plugins | Comma separated list of output plugins used for the purposes of calculating output rate. | - |
232+
| HC_Throughput_Ratio_Threshold | OUTPUT_RATE/INPUT_RATE ratio threshold at which we consider a failure. If the ratio is below this number, then the current check fails. Note that a single check is not enough to trigger a health error, see `Hc_Throughput_Min_Failures` below for details. | - |
233+
| HC_Throughput_Min_Failures | Minimum amount of consecutive ratio check failures required before the health endpoint will return an error. For example, if this is 60 and the default Hc_Period, the ratio must be below threshold for 1 minute before an error is returned. | - |
229234

230235
*Note: Not every error log means an error nor be counted, the errors retry failures count only on specific errors which is the example in config table description*
231236

@@ -277,6 +282,21 @@ If (HC_Errors_Count > 5) OR (HC_Retry_Failure_Count > 5) IN 5 seconds is TRUE, t
277282
If (HC_Errors_Count > 5) OR (HC_Retry_Failure_Count > 5) IN 5 seconds is FALSE, then it's healthy.
278283

279284

285+
### Throughput health check
286+
287+
If `Hc_Throughput` and other related options are set, fluent-bit will monitor output/input ratio, and the health endpoint will return error if ratio is below the configured threshold. For example:
288+
289+
```
290+
hc_throughput On
291+
hc_throughput_input_plugins tail.0
292+
hc_throughput_output_plugins http.0
293+
hc_throughput_ratio_threshold 0.1
294+
hc_throughput_min_failures 60
295+
```
296+
297+
In the above example, if the http output rate is below 1/10 of the tail input rate for 1 consecutive minute, then the `/api/v1/health` endpoint will return `error`. Note that if the ratio goes above threshold, it will restore the `OK` status until another minute of consecutive failed checks passes.
298+
299+
280300
## Calyptia Cloud
281301

282302
[Calyptia Cloud](https://cloud.calyptia.com) is a hosted service that allows you to monitor your Fluent Bit agents including data flow, metrics and configurations.

0 commit comments

Comments
 (0)