Skip to content

engine: fix high memory usage in FluentLogEventRouter #4996

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

Watson1978
Copy link
Contributor

@Watson1978 Watson1978 commented Jun 5, 2025

Which issue(s) this PR fixes:
Fixes #4994

What this PR does / why we need it:
Previously, FluentLogEventRouter was designed to start processing only after the worker had started.

As a result, if a large number of warnings were emitted by the in_tail plugin immediately after the worker started,
lots of warning messages would accumulate in the FluentLogEventRouter's queue.
And it consumes a significant amount of memory.

This PR changes the behavior so that FluentLogEventRouter starts processing before the worker is fully started, allowing it to consume the message queue immediately.

With this change, memory usage is significantly reduced compared to the previous implementation.

Here is the memory usage was obtained using the same operation as #4994.

FYI)
You can confirm the memory usage easily with the following sample code.

queue = Thread::Queue.new

500_000.times do |i|
  queue << "a" * 1024
end

while !queue.empty?
  queue.pop
end

GC.start

rss = Integer(`ps -o rss= -p #{Process.pid}`) / 1024.0
puts "Memory Usage: #{rss} MB"
$ ruby t.rb
Memory Usage: 539.25 MB

Docs Changes:

Release Note:

@Watson1978 Watson1978 marked this pull request as ready for review June 5, 2025 09:55
@Watson1978 Watson1978 requested review from kenhys and daipom June 5, 2025 09:56
@daipom
Copy link
Contributor

daipom commented Jun 5, 2025

Thanks!
I see. Actually, in_tail starts reading logs before completing start phase unless setting skip_refresh_on_startup.
(This may not be preferable...)

However, it appears to me that we have to be very careful about changing this starting order.
Since FluentLogEventRouter is related to event routing, it would be safe to start it after all plugins have been started.

For example, it would cause a bug if events are emitted without some plugins started.
(in_tail is doing this, but fortunately, the input plugins start last, so it likely hasn't caused any issues.)

@ashie
Copy link
Member

ashie commented Jun 6, 2025

It might be better to implement #1504 instead of this change.
Running refresh_watchers for huge amount of logs at start phase causes any other problems too, for example it blocks starting process of any other plugins & fluentd itself until finishing it.

@Watson1978
Copy link
Contributor Author

it would be safe to start it after all plugins have been started.

I can't understand that the current is safe because it is already known to cause memory leaks.

@daipom
Copy link
Contributor

daipom commented Jun 6, 2025

it would be safe to start it after all plugins have been started.

I can't understand that the current is safe because it is already known to cause memory leaks.

I don't mean to say the current code is entirely safe.
I mean that starting FluentLogEventRouter after all plugins have been started is safe for the event routing process.

@daipom
Copy link
Contributor

daipom commented Jun 6, 2025

It might be better to implement #1504 instead of this change. Running refresh_watchers for huge amount of logs at start phase causes any other problems too, for example it blocks starting process of any other plugins & fluentd itself until finishing it.

I see!
It would not be a fundamental solution, but it would be very beneficial!

@Watson1978
Copy link
Contributor Author

For example, it would cause a bug if events are emitted without some plugins started.

I think FluentLogEventRouter handles log routing.
The log should be generated because the plugin is running, and if it is not, it should be fixed.

@Watson1978 Watson1978 closed this Jun 9, 2025
@Watson1978 Watson1978 deleted the engine branch June 9, 2025 01:20
@Watson1978
Copy link
Contributor Author

We need another approach to fix this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

memory leak with pattern not matched error
3 participants