Skip to content

Conversation

@cosmo0920
Copy link
Contributor

@cosmo0920 cosmo0920 commented Oct 21, 2025

For avoiding to skip long line consumption,
it sometimes needs to consume until the limit of buffers. This could provide different approach of mitigation for consuming long lines.

Fixes #10435.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • New Features

    • Optional truncation of excessively long log lines with UTF‑8‑safe trimming and a new metric counting truncated occurrences.
    • New configuration flag to enable/disable truncation.
  • Bug Fixes

    • Improved resource cleanup on initialization/validation failures.
    • Added mutual‑exclusion check for conflicting encoding options to prevent misconfiguration.
  • Tests

    • New tests verifying truncation behavior for long ASCII and UTF‑8 lines.

@coderabbitai
Copy link

coderabbitai bot commented Oct 21, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds a configurable truncate_long_lines option to in_tail; implements UTF‑8‑safe truncation when enabled, registers a new truncation metric, improves config cleanup on init errors, updates per-chunk byte accounting, and adds tests for ASCII and UTF‑8 truncation.

Changes

Cohort / File(s) Summary
Configuration binding
plugins/in_tail/tail.c
Adds boolean config option truncate_long_lines (default false) and binds it to flb_tail_config via offsetof.
Metrics & struct
plugins/in_tail/tail_config.h
Adds #define FLB_TAIL_METRIC_L_TRUNCATED 104, int truncate_long_lines; and struct cmt_counter *cmt_long_line_truncated; to flb_tail_config.
Initialization & cleanup
plugins/in_tail/tail_config.c
Replaces flb_free(ctx) error-paths with flb_tail_config_destroy(ctx); adds unicode/generic encoding exclusivity check; initializes/registers the truncation metric counter.
Truncation logic
plugins/in_tail/tail_file.c
Adds static utf8_safe_truncate_pos() and integrates UTF‑8‑safe truncation flow when truncate_long_lines is enabled: computes cut position, packs/truncates line, updates offsets/bytes, sets skip flags, increments truncation metric, and ensures cleanup via a truncation_end path. Adjusts decoded-length tracking.
Tests
tests/runtime/in_tail.c
Adds helpers to write long ASCII and UTF‑8 lines and two tests (truncate_long_lines, truncate_long_lines_utf8) that enable truncate_long_lines and assert expected outputs.

Sequence Diagram(s)

sequenceDiagram
    participant Reader as File Reader
    participant Processor as Line Processor (in_tail)
    participant Truncator as UTF-8 Truncator
    participant Metrics as Metrics Registry
    participant Output as Output Queue

    Note right of Reader: read chunk from file
    Reader->>Processor: supply chunk bytes
    Processor->>Processor: decode/convert chunk (ret = decoded length)
    Processor->>Processor: search newline within eff_max window

    alt newline found or within limits
        Processor->>Output: emit complete line
    else truncate_long_lines enabled and dec_len >= eff_max
        Processor->>Truncator: utf8_safe_truncate_pos(buf, ret, eff_max)
        Truncator->>Processor: return cut position
        Processor->>Output: emit truncated segment
        Processor->>Metrics: increment long_line_truncated counter
        Processor->>Processor: set skip_next, adjust bytes/offsets
    else truncate_long_lines disabled
        Processor->>Processor: skip/drop until newline (existing behavior)
    end

    Processor->>Reader: return processed byte count / update state
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

Suggested labels

backport to v4.0.x

Suggested reviewers

  • edsiper
  • koleini
  • fujimotos

🐰 I nibble bytes that wander long,
I trim with care where lines go wrong.
UTF‑8 safe cuts, metrics in sight,
I hop, I count, then send them right. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Out of Scope Changes Check ⚠️ Warning The pull request includes one potentially out-of-scope change: in tail_config.c, a new mutual exclusivity check is introduced that validates unicode.encoding and generic.encoding options cannot both be specified when FLB_HAVE_UNICODE_ENCODER is defined. This change appears unrelated to the long line truncation feature described in issue #10435 and is not mentioned in the PR objectives. While other changes like error-path deallocations and metric initialization can be justified as supporting infrastructure for the truncation feature, this encoding validation check stands apart from the truncation implementation. The unicode.encoding and generic.encoding mutual exclusivity check should either be removed from this PR or its relationship to the truncation feature should be clearly documented. If this check addresses a separate bug or safety concern, it would be more appropriate to submit it as an independent pull request to keep the scope focused and simplify review.
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "in_tail: Implement long line truncation" is clear, concise, and directly summarizes the main change in the pull request. It accurately reflects the primary feature being introduced: a new long line truncation capability in the in_tail plugin. The title uses appropriate naming conventions and is specific enough for someone reviewing the git history to understand the core objective without ambiguity.
Linked Issues Check ✅ Passed The pull request successfully implements the primary objective from issue #10435, which requests truncation of long log lines to prevent file monitoring from stopping when lines exceed buffer limits. The implementation includes a new truncate_long_lines configuration option, UTF-8 safe truncation logic in tail_file.c, metrics tracking for truncated lines, and test coverage demonstrating truncation behavior for both ASCII and UTF-8 encoded lines. While the feature defaults to OFF (opt-in rather than always-on), this still fulfills the core requirement to truncate lines and consume them up to buffer boundaries instead of skipping them entirely.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cosmo0920-implement-long-line-truncation

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee50311 and 903bcbf.

📒 Files selected for processing (1)
  • plugins/in_tail/tail_file.c (8 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: cosmo0920
PR: fluent/fluent-bit#11059
File: plugins/in_tail/tail_file.c:1618-1640
Timestamp: 2025-10-23T07:43:16.197Z
Learning: In plugins/in_tail/tail_file.c, when truncate_long_lines is enabled and the buffer is full, the early truncation path uses `lines > 0` as the validation pattern to confirm whether process_content successfully processed content. This is intentional to track occurrences of line processing rather than byte consumption, and consuming bytes based on `processed_bytes > 0` would be overkill for this validation purpose.
📚 Learning: 2025-10-23T07:43:16.197Z
Learnt from: cosmo0920
PR: fluent/fluent-bit#11059
File: plugins/in_tail/tail_file.c:1618-1640
Timestamp: 2025-10-23T07:43:16.197Z
Learning: In plugins/in_tail/tail_file.c, when truncate_long_lines is enabled and the buffer is full, the early truncation path uses `lines > 0` as the validation pattern to confirm whether process_content successfully processed content. This is intentional to track occurrences of line processing rather than byte consumption, and consuming bytes based on `processed_bytes > 0` would be overkill for this validation purpose.

Applied to files:

  • plugins/in_tail/tail_file.c
🧬 Code graph analysis (1)
plugins/in_tail/tail_file.c (5)
plugins/in_tail/tail_multiline.c (1)
  • flb_tail_mult_flush (486-534)
lib/cmetrics/src/cmt_counter.c (1)
  • cmt_counter_inc (100-117)
src/flb_input.c (1)
  • flb_input_name (790-797)
src/flb_metrics.c (1)
  • flb_metrics_sum (180-191)
plugins/in_tail/tail_db.c (1)
  • flb_tail_db_file_offset (290-321)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (30)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
  • GitHub Check: pr-compile-without-cxx (3.31.6)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-22.04, clang-12)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-22.04, clang-12)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
  • GitHub Check: pr-compile-centos-7
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
  • GitHub Check: PR - fuzzing test
🔇 Additional comments (4)
plugins/in_tail/tail_file.c (4)

460-476: LGTM! UTF-8 truncation is now correct.

The function properly checks s[cut] (first omitted byte) to detect continuation bytes and backtracks to avoid splitting multi-byte codepoints. This correctly addresses the issue from the previous review.


571-621: LGTM! Truncation logic is well-structured.

The implementation correctly:

  • Uses buf_max_size as the truncation threshold (line 575), matching the fix from previous reviews
  • Performs UTF-8 safe truncation when no newline is found within the window
  • Handles the skip_next continuation flow (lines 588-591) to consume remaining long-line content until a newline is encountered
  • Flushes multiline state before truncating to avoid incomplete multiline records
  • Increments the truncation metric appropriately

1616-1638: LGTM! Early truncation path uses correct validation pattern.

The code appropriately attempts truncation when the buffer is full and consumes bytes when lines > 0. As noted in learnings, this validation pattern intentionally tracks occurrences of line processing rather than byte consumption. The fallback to the legacy skip path when lines == 0 is reasonable for cases where truncation doesn't apply (e.g., when decoded data is still below the threshold).

Based on learnings


781-799: LGTM! Cleanup and byte accounting are correct.

The truncation_end label provides a clean exit path for both truncation scenarios (actual truncation and skip_next continuation). The conditional byte accounting (lines 791-799) correctly uses bytes_override when truncation occurs—this consumes the entire buffer after packing the truncated portion, which is necessary since skip_next is set to skip the remainder of the long line.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

For avoiding to skip long line consumption,
it sometimes needs to consume until the limit of buffers.
This could provide different approach of mitigation for consuming long lines.

Signed-off-by: Hiroshi Hatake <[email protected]>
@cosmo0920 cosmo0920 added this to the Fluent Bit v4.2 milestone Oct 22, 2025
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
plugins/in_tail/tail_config.c (2)

491-503: Metric help string typo.

“occurences” → “occurrences”. Keeps consistency with other counters.

-                               "Total number of truncated occurences for long lines",
+                               "Total number of truncated occurrences for long lines",

138-149: Initialize pipe channel arrays to -1 immediately after calloc to prevent closing stdin on early failure.

The risk is real: if flb_pipe_create(ctx->ch_pending) fails at line 136, then ch_pending[0] and ch_pending[1] remain zero-initialized. The subsequent flb_tail_config_destroy() call at line 139 unconditionally invokes flb_pipe_close() on all four channels. The Linux guard checks if (fd == -1) but does not protect against fd=0; the Windows version lacks any guard. Either way, close(0) executes, closing stdin.

Apply the suggested initialization pattern immediately after flb_calloc():

ctx = flb_calloc(1, sizeof(struct flb_tail_config));
if (!ctx) {
    flb_errno();
    return NULL;
}
+/* Initialize pipe fds to -1 to safely guard early destroy() calls */
+ctx->ch_manager[0] = ctx->ch_manager[1] = -1;
+ctx->ch_pending[0] = ctx->ch_pending[1] = -1;
ctx->config = config;
🧹 Nitpick comments (5)
plugins/in_tail/tail.c (1)

724-727: Config option addition LGTM; clarify precedence with skip_long_lines.

When both truncate_long_lines=on and skip_long_lines=on, truncation currently wins. Consider documenting that or warning if both enabled.

Confirm intended precedence in docs.

tests/runtime/in_tail.c (1)

994-1024: Harden writes against partial writes.

write(2) on files can return short counts; loop until full to avoid flaky tests.

 static int write_long_ascii_line(int fd, size_t total_bytes)
 {
-    const char *chunk = "0123456789abcdef0123456789abcdef"; /* 32 bytes */
+    const char *chunk = "0123456789abcdef0123456789abcdef"; /* 32 bytes */
     size_t chunk_len = strlen(chunk);
     size_t written = 0;
-    ssize_t ret;
+    ssize_t ret;
     size_t rest = 0;
 
     while (written + chunk_len <= total_bytes) {
-        ret = write(fd, chunk, chunk_len);
-        if (ret < 0) {
+        size_t off = 0;
+        while (off < chunk_len) {
+            ret = write(fd, chunk + off, chunk_len - off);
+            if (ret <= 0) { flb_errno(); return -1; }
+            off += (size_t) ret;
+        }
-        }
-        written += (size_t) ret;
+        written += chunk_len;
     }
     if (written < total_bytes) {
         rest = total_bytes - written;
-        ret = write(fd, chunk, rest);
-        if (ret < 0) {
+        size_t off = 0;
+        while (off < rest) {
+            ret = write(fd, chunk + off, rest - off);
+            if (ret <= 0) { flb_errno(); return -1; }
+            off += (size_t) ret;
+        }
-        }
-        written += (size_t) ret;
+        written += rest;
     }
@@
 static int write_long_utf8_line(int fd, size_t total_bytes)
 {
     const char *u8_aa = "あ";
     size_t u8_len = strlen(u8_aa); /* 3 */
     size_t written = 0;
-    ssize_t ret;
+    ssize_t ret;
     const char *ascii = "XYZ";
     size_t rest = 0;
 
     while (written + u8_len <= total_bytes) {
-        ret = write(fd, u8_aa, u8_len);
-        if (ret < 0) {
+        size_t off = 0;
+        while (off < u8_len) {
+            ret = write(fd, u8_aa + off, u8_len - off);
+            if (ret <= 0) { flb_errno(); return -1; }
+            off += (size_t) ret;
+        }
-        }
-        written += (size_t) ret;
+        written += u8_len;
     }
 
     if (written < total_bytes) {
         rest = total_bytes - written;
-        if (rest > strlen(ascii)) {
+        if (rest > strlen(ascii)) {
             rest = strlen(ascii);
         }
-        ret = write(fd, ascii, rest);
-        if (ret < 0) {
+        size_t off = 0;
+        while (off < rest) {
+            ret = write(fd, ascii + off, rest - off);
+            if (ret <= 0) { flb_errno(); return -1; }
+            off += (size_t) ret;
+        }
-        }
-        written += (size_t) ret;
+        written += rest;
     }

Also applies to: 1026-1061

plugins/in_tail/tail_file.c (3)

460-479: UTF‑8 cut helper looks correct. Add a brief comment on invariants.

Optional: note that caller guarantees max < len before call.


593-605: Window computation is inconsistent (computed twice, second overrides first).

You first derive window from eff_max, then overwrite with buf_max_size+1. Keep one source of truth (eff_max) for clarity and to avoid future regressions.

-        /* Set the search window for memchr. Add 1 because memchr is (ptr, char, size) */
-        if (eff_max > 0) {
-            window = eff_max + 1;
-        }
-        else {
-            window = 0;
-        }
-        dec_len = (size_t)(end - data);
-        window = ctx->buf_max_size + 1;
+        /* Set the search window for memchr to the effective threshold (+1) */
+        dec_len = (size_t)(end - data);
+        window = (eff_max > 0) ? (eff_max + 1) : 0;
         if (window > dec_len) {
             window = dec_len;
         }

607-642: Truncation flow LGTM; small robustness nits.

  • Consider checking the return of flb_tail_file_pack_line and logging on failure.
  • When skip_next is already true and no newline found, you consume the whole decoded window via bytes_override; that’s intentional, but add a short comment to make it explicit.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 65722e8 and 75de1c5.

📒 Files selected for processing (5)
  • plugins/in_tail/tail.c (1 hunks)
  • plugins/in_tail/tail_config.c (7 hunks)
  • plugins/in_tail/tail_config.h (3 hunks)
  • plugins/in_tail/tail_file.c (8 hunks)
  • tests/runtime/in_tail.c (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
plugins/in_tail/tail_file.c (5)
plugins/in_tail/tail_multiline.c (1)
  • flb_tail_mult_flush (486-534)
lib/cmetrics/src/cmt_counter.c (1)
  • cmt_counter_inc (100-117)
src/flb_input.c (1)
  • flb_input_name (790-797)
src/flb_metrics.c (1)
  • flb_metrics_sum (180-191)
plugins/in_tail/tail_db.c (1)
  • flb_tail_db_file_offset (290-321)
plugins/in_tail/tail_config.c (2)
lib/cmetrics/src/cmt_counter.c (1)
  • cmt_counter_create (26-81)
src/flb_metrics.c (1)
  • flb_metrics_add (129-178)
tests/runtime/in_tail.c (1)
src/flb_lib.c (9)
  • flb_create (138-220)
  • flb_service_set (647-673)
  • flb_input (261-271)
  • flb_input_set (300-330)
  • flb_output (274-284)
  • flb_output_set (515-546)
  • flb_start (914-925)
  • flb_stop (942-985)
  • flb_destroy (223-258)
🔇 Additional comments (4)
plugins/in_tail/tail_config.c (2)

513-515: Old metrics label addition LGTM.


238-246: Good exclusivity check for encodings.

Please confirm this block is compiled on all targets where both features can be enabled (build matrix).

plugins/in_tail/tail_file.c (2)

622-629: Metrics increment: OK; keep help/label spelling consistent.

Ensure help string spelling fixed in tail_config.c as noted.


1613-1657: Nice: early truncation path when buffer is full.

This prevents unbounded growth and keeps forward progress.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
plugins/in_tail/tail_file.c (1)

460-478: UTF-8 boundary validation may be incomplete.

The function backtracks over continuation bytes (0x80-0xBF) but doesn't verify that the resulting cut position points to a valid UTF-8 start byte. If max lands in the middle of a multi-byte sequence, backtracking ensures we don't split continuation bytes, but we might still cut at an invalid position if the data contains malformed UTF-8.

Consider adding validation that s[cut] (if cut < len) is a valid start byte (0x00-0x7F, 0xC0-0xDF, 0xE0-0xEF, or 0xF0-0xF7).

     while (cut > 0 && ((unsigned char)s[cut - 1] & 0xC0) == 0x80) {
         cut--;
     }
+    
+    /* Optionally verify we're at a valid UTF-8 start byte */
+    if (cut > 0 && cut < len) {
+        unsigned char c = (unsigned char)s[cut];
+        /* Valid start bytes: 0xxxxxxx, 110xxxxx, 1110xxxx, 11110xxx */
+        if ((c & 0x80) != 0 && (c & 0xC0) != 0xC0) {
+            /* Invalid start byte after backtrack, search backward for valid start */
+            while (cut > 0) {
+                cut--;
+                c = (unsigned char)s[cut];
+                if ((c & 0x80) == 0 || (c & 0xC0) == 0xC0) {
+                    break;
+                }
+            }
+        }
+    }
 
     return cut;
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 75de1c5 and 9eb7421.

📒 Files selected for processing (1)
  • plugins/in_tail/tail_file.c (8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
plugins/in_tail/tail_file.c (5)
plugins/in_tail/tail_multiline.c (1)
  • flb_tail_mult_flush (486-534)
lib/cmetrics/src/cmt_counter.c (1)
  • cmt_counter_inc (100-117)
src/flb_input.c (1)
  • flb_input_name (790-797)
src/flb_metrics.c (1)
  • flb_metrics_sum (180-191)
plugins/in_tail/tail_db.c (1)
  • flb_tail_db_file_offset (290-321)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (30)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
  • GitHub Check: pr-compile-without-cxx (3.31.6)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-22.04, clang-12)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-22.04, clang-12)
  • GitHub Check: pr-compile-centos-7
  • GitHub Check: PR - fuzzing test
🔇 Additional comments (6)
plugins/in_tail/tail_file.c (6)

504-511: LGTM!

The new local variables for truncation state are well-named and properly initialized.


561-562: Good optimization: using precise decoded length.

Using the return value from the conversion function is more efficient and accurate than strlen(decoded), which would require scanning the entire buffer.


802-806: LGTM!

The truncation_end label is correctly placed to ensure cleanup of the decoded buffer while bypassing normal line processing when truncation occurs.


812-820: LGTM!

The three-way byte accounting correctly handles truncation, encoding conversion, and normal processing cases.


1613-1619: LGTM!

Variable declarations are properly initialized.


616-618: Multiline flush behavior during truncation is correct; state is properly reset.

The multiline state machine is properly reset after the flush. flb_tail_mult_pack_line_body() (called from flb_tail_mult_flush()) resets all multiline state variables at its conclusion (lines 471-474): mult_firstline, mult_skipping, mult_keys, and mult_flush_timeout. This ensures the multiline state is clean before the truncated line is packed, preventing confusion in subsequent iterations.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
plugins/in_tail/tail_file.c (2)

573-624: Simplify truncation trigger; remove redundant buf_size gate and bound search window

The condition already requires dec_len >= eff_max (buf_max_size−1). When buf_size < buf_max_size, dec_len cannot reach eff_max, so the extra check file->buf_size >= ctx->buf_max_size is redundant. Also, set the memchr window to min(eff_max+1, dec_len) to avoid scanning beyond the effective limit.

-        dec_len = (size_t)(end - data);
-        window = ctx->buf_max_size + 1;
-        if (window > dec_len) {
-            window = dec_len;
-        }
+        dec_len = (size_t)(end - data);
+        /* Search at most up to the effective threshold (+1 for '\n') */
+        window = eff_max + 1;
+        if (window > dec_len) {
+            window = dec_len;
+        }
 
         nl = memchr(data, '\n', window);
-        if (file->buf_size >= ctx->buf_max_size &&
-            nl == NULL && eff_max > 0 && dec_len >= eff_max) {
+        if (nl == NULL && eff_max > 0 && dec_len >= eff_max) {

This makes the logic consistent and avoids unnecessary coupling to current allocation size. Related to earlier feedback on the threshold logic.


1618-1640: Gate early processing by maxed buffer to avoid no‑op passes

Early call to process_content happens whenever capacity < 1, but truncation only triggers once we hit the effective threshold. Guard the early path with file->buf_size >= ctx->buf_max_size so we don't spin a no‑op that just falls through to growth.

-        if (ctx->truncate_long_lines == FLB_TRUE) {
+        if (ctx->truncate_long_lines == FLB_TRUE &&
+            file->buf_size >= ctx->buf_max_size) {
             lines = process_content(file, &processed_bytes);
             ...
         }

Keeps intent clear and aligns with the truncation condition.

🧹 Nitpick comments (1)
plugins/in_tail/tail_file.c (1)

793-799: Bytes accounting after truncation looks correct; minor clarity tweak optional

The bytes_override path is used only when truncation_happened; when not, original_len (if set) wins, else processed_bytes. Consider a small comment noting why original_len takes precedence (pre‑decode byte accounting). No functional change.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9eb7421 and 31790d5.

📒 Files selected for processing (1)
  • plugins/in_tail/tail_file.c (8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
plugins/in_tail/tail_file.c (5)
plugins/in_tail/tail_multiline.c (1)
  • flb_tail_mult_flush (486-534)
lib/cmetrics/src/cmt_counter.c (1)
  • cmt_counter_inc (100-117)
src/flb_input.c (1)
  • flb_input_name (790-797)
src/flb_metrics.c (1)
  • flb_metrics_sum (180-191)
plugins/in_tail/tail_db.c (1)
  • flb_tail_db_file_offset (290-321)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (30)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-22.04, clang-12)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-22.04, clang-12)
  • GitHub Check: pr-compile-centos-7
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
  • GitHub Check: pr-compile-without-cxx (3.31.6)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
  • GitHub Check: PR - fuzzing test
🔇 Additional comments (1)
plugins/in_tail/tail_file.c (1)

560-562: Good: use precise decoded length from generic converter

Assigning end = data + ret avoids reliance on stale end. LGTM.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
plugins/in_tail/tail_file.c (1)

460-478: Fix UTF‑8 cut: current backtrack can return a lone lead byte

Backtrack must inspect the first omitted byte (s[cut]), not the last included (s[cut-1]); otherwise max=1 on a multibyte char yields invalid UTF‑8. Apply:

 static size_t utf8_safe_truncate_pos(const char *s, size_t len, size_t max)
 {
-    size_t cut = 0;
-
-    cut = (len <= max) ? len : max;
+    size_t cut = (len <= max) ? len : max;
     if (cut == len) {
         return cut;
     }
 
-   /* backtrack over continuation bytes 10xxxxxx
-    * NOTE: check the last INCLUDED byte => s[cut-1], not s[cut].
-    */
-    while (cut > 0 && ((unsigned char)s[cut - 1] & 0xC0) == 0x80) {
+    /* If first omitted byte is a continuation (10xxxxxx), step back */
+    while (cut > 0 && cut < len &&
+           ((unsigned char)s[cut] & 0xC0) == 0x80) {
         cut--;
     }
 
     return cut;
 }

Optionally add tests that truncate at max=1 for 2/3/4‑byte codepoints.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31790d5 and ee50311.

📒 Files selected for processing (1)
  • plugins/in_tail/tail_file.c (8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
plugins/in_tail/tail_file.c (5)
plugins/in_tail/tail_multiline.c (1)
  • flb_tail_mult_flush (486-534)
lib/cmetrics/src/cmt_counter.c (1)
  • cmt_counter_inc (100-117)
src/flb_input.c (1)
  • flb_input_name (790-797)
src/flb_metrics.c (1)
  • flb_metrics_sum (180-191)
plugins/in_tail/tail_db.c (1)
  • flb_tail_db_file_offset (290-321)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (30)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-22.04, clang-12)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-22.04, clang-12)
  • GitHub Check: pr-compile-centos-7
  • GitHub Check: pr-compile-without-cxx (3.31.6)
  • GitHub Check: PR - fuzzing test
🔇 Additional comments (1)
plugins/in_tail/tail_file.c (1)

560-562: Good: use precise decoded length

Using ret to set end after generic encoding conversion is correct and avoids stale length math.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Default for skip_long_lines should be ON -or- Long Lines should truncate

1 participant