RabbitMQ docker image takes more time to start when using home dir in an EFS mount #471

rahulsadanandan · 2021-02-09T09:46:45Z

rahulsadanandan
Feb 9, 2021

Describe the bug
RabbitMQ Docker image takes a long time to start (5- 10 minutes) when using an efs mount as home directory location(/var/lib/rabbitmq).

To Reproduce
Steps to reproduce the behavior:

Create an efs file system
Create an ec2 instance mounting the efs file system to a path (/mnt/efs/fs1)
Run without using an efs mount
docker run --hostname my-rabbit --name some-rabbit rabbitmq:3.8.11
Starts normally. logs attached with log.file.level = debug
without_efs.log
Run using an efs mount for home dir
docker run -v /mnt/efs/fs1:/var/lib/rabbitmq --hostname my-rabbit --name some-rabbit rabbitmq:3.8.11
Takes longer time to start. logs attached with log.file.level = debug
#efs.log

Additional Information
This is also reproducible in Kubernetes (using bitnami rabbitmq chart) .

#bitnami/charts#4936

wglambert · 2021-02-09T16:36:55Z

wglambert
Feb 9, 2021

The bitnami chart/image isn't derived from this image https://github.com/bitnami/bitnami-docker-rabbitmq/blob/master/3.8/debian-10/Dockerfile

The discussion in that thread is pretty informative for troubleshooting the issue, notably bitnami/charts#4936 (comment) and michaelklishin's comment about using strace. So I'm not sure if there's anything more we can add to that in a separate thread here.

0 replies

rahulsadanandan · 2021-02-23T04:13:51Z

rahulsadanandan
Feb 23, 2021
Author

@michaelklishin @wglambert
This issue happens for the official image of rabbitmq (https://hub.docker.com/_/rabbitmq)

I did use the debug and strace and here are my findings. please have a look when you have some time

while its stuck for 7 minutes at

[root@ip-172-31-27-174 fs1]# docker run --cap-add SYS_PTRACE -v /mnt/efs/fs1:/var/lib/rabbitmq -v /mnt/efs/fs1/rabbitmq.conf:/etc/rabbitmq/rabbitmq.conf --hostname my-rabbit --name some-rabbit rabbitmq:3.8.11
Unable to find image 'rabbitmq:3.8.11' locally
3.8.11: Pulling from library/rabbitmq
d519e2592276: Pull complete
d22d2dfcfa9c: Pull complete
b3afe92c540b: Pull complete
cd4e41ce9500: Pull complete
e2741828ce46: Extracting [=======================================>           ]  25.56MB/32.73MB
e2741828ce46: Pull complete
6cf1935b659a: Pull complete
3df71d67553c: Pull complete
ac4f52d15541: Pull complete
0af823fd61c8: Pull complete
85579530757b: Pull complete
Digest: sha256:52e73c649b3ef628fb2b0dafd5b043c0b397bd188a0326a6514d37662d84b425
Status: Downloaded newer image for rabbitmq:3.8.11
Configuring logger redirection

Strace outputs

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
rabbitmq     1  0.2  0.0   4636   888 ?        Ss   03:55   0:00 /bin/sh /opt/rabbitmq/sbin/rabbitmq-server
rabbitmq    16  2.7  4.9 1686864 50052 ?       Sl   03:55   0:02 /usr/local/lib/erlang/erts-11.1.7/bin/beam.smp -W w -MBas ageffcbf -MHas ageffcbf -MBlmbcs 512 -MHlmbcs 512 -MMmcs 30 -P 1048576
rabbitmq    23  0.0  0.0   4528   880 ?        Ss   03:55   0:00 erl_child_setup 1024
rabbitmq    48  0.0  0.0   8280    88 ?        S    03:55   0:00 /usr/local/lib/erlang/erts-11.1.7/bin/epmd -daemon
rabbitmq    68  0.0  0.1   8272  1180 ?        Ss   03:55   0:00 inet_gethost 4
rabbitmq    69  0.0  0.1  10392  1716 ?        S    03:55   0:00 inet_gethost 4
root        70  0.5  0.3  20264  3840 pts/0    Ss   03:56   0:00 bash
root       330  0.0  0.3  36160  3284 pts/0    R+   03:57   0:00 ps -aux


root@my-rabbit:/# strace -p 1
strace: Process 1 attached
rt_sigsuspend([], 8


root@my-rabbit:/# strace -p 16
strace: Process 16 attached
select(0, NULL, NULL, NULL, NULL


root@my-rabbit:/# strace -p 23
strace: Process 23 attached
select(5, [3 4], NULL, NULL, NULL


root@my-rabbit:/# strace -p 48
strace: Process 48 attached
select(7, [3 4 5], NULL, NULL, {tv_sec=0, tv_usec=147862}) = 0 (Timeout)
select(7, [3 4 5], NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
select(7, [3 4 5], NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
select(7, [3 4 5], NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
select(7, [3 4 5], NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
select(7, [3 4 5], NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
select(7, [3 4 5], NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
select(7, [3 4 5], NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
select(7, [3 4 5], NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
select(7, [3 4 5], NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
select(7, [3 4 5], NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
select(7, [3 4 5], NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
select(7, [3 4 5], NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
select(7, [3 4 5], NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
select(7, [3 4 5], NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
select(7, [3 4 5], NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
select(7, [3 4 5], NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
select(7, [3 4 5], NULL, NULL, {tv_sec=5, tv_usec=0}

cc @chukka @eldada

0 replies

yosifkit · 2021-02-24T21:57:24Z

yosifkit
Feb 24, 2021
Maintainer

Sounds similar to helm/charts#1711 (so EFS is just NFS behind the scenes?).

Although about elasticsearch, this post seems relevant since RabbitMQ likely also cares about filesystem performance:

EFS-based storage is not recommended or supported as it does not offer satisfactory performance. Historically, shared network filesystems such as EFS have not always offered precisely the behaviour that Elasticsearch requires of its filesystem

0 replies

debu99 · 2021-03-16T07:41:17Z

debu99
Mar 16, 2021

I have the same issue with rabbitmq:latest
rabbitmq:3.6/3.7 is working

0 replies

msolberg8 · 2021-04-16T13:55:03Z

msolberg8
Apr 16, 2021

I'm able to run RabbitMQ 3.8.11 just fine persisting to our internal NFS, but when EFS gets involved start up times turn nasty. Interesting thing though is I see fine run time performance. @michaelklishin would you have any ideas?

Just a note, I spent a significant amount of time with AWS support before we found this issue and we can confirm the EFS is functioning just fine, it just seems that for some reason rabbit won't write very quickly to it.

Just watching the EFS during start up, it seems to be writing 400mb of quorum queue data very slowly. We're not using this feature yet so I'm not too familiar with it but this happens every start up. It deletes the data and rewrites it.

1 reply

markusschaber Jul 6, 2021

We once had GIT repositories on EFS in a prototype stadium, and it was very slow. We found out that while writing huge files is very fast, writing lots of small files, especially with locking and flushing involved, is very slow. So if Rabbit also uses locks and flushes for the quorum queue data, this might explain why it's slow.

rahulsadanandan · 2021-07-12T07:29:46Z

rahulsadanandan
Jul 12, 2021
Author

Hi team.
we were tracking the same issue with aws support. They were able to replicate the issue in their lab environment and came back with the below response

We have investigated your workload (this is my lab environment where I have replicated your use case) during the period of June 24th 19:00 UTC to 20:00 UTC. We noticed that the message broker service, RabbitMQ, performs small-sized (less than 4kb) writes to a single file at a time.
These are slot-conditional partial page writes meaning the page needs to be read before being written to, which adds latency.
When the application is starting up, it is performing these writes multiple times. Of the NFS operations that take over 100ms, over 99.99% are small writes.
In order to improve the startup time, we are wondering if you can distribute your writes across multiple threads and perform larger writes.

we would further test with higher throughput and operations per second from the efs side.
Any other thoughts would be really helpful @michaelklishin @yosifkit

0 replies

rahulsadanandan · 2021-07-27T14:53:00Z

rahulsadanandan
Jul 27, 2021
Author

Reducing the raft.wal_max_size_bytes to 1MB, and we saw that rabbitmq boots up faster like normal.
Just wanted to confirm here, if we are not using the quorum queues, is it okay to set this value to 1MB.?

1 reply

michaelklishin Jul 27, 2021
Collaborator

It is reasonable to decrease it as it is 512 MiB or so by default. When you do adopt quorum queues eventually (I assume you will use a replicated queue type at some point and classic queue mirroring is going away in a future version), you'd have to revisit this change. A 16-32 MiB value can still be perfectly sufficient for many workloads.

luispabon · 2024-05-03T09:56:02Z

luispabon
May 3, 2024

Was this issue ever fixed? I'm seeing the same problem but with EBS / ext4 storage, not EFS.

3 replies

lukebakken May 3, 2024
Collaborator

Rather than comment on a closed discussion, start a new one, and provide much more detail than "I'm seeing the same problem". Be sure to read the comments here first, they are still relevant (#471 (comment))

luispabon May 3, 2024

How is this a closed discussion? I'm adding more detail to this one as well but checking if anyone here knows if the original problem was fixed. If anyone knows, they'd be here. The symptoms are the same - it hangs on "initialising in the background" for about 2 minutes, just like the original reports.

lukebakken May 3, 2024
Collaborator

How is this a closed discussion?

Because nobody provided any further detail, or offered to provide an environment for investigation 🤷‍♂️

Many, many production environments use this RabbitMQ docker image (including those on Amazon AWS using EBS), yet only a few people have commented here, so it can't be considered a priority.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RabbitMQ docker image takes more time to start when using home dir in an EFS mount #471

{{title}}

Replies: 8 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

RabbitMQ docker image takes more time to start when using home dir in an EFS mount #471

rahulsadanandan Feb 9, 2021

Replies: 8 comments · 5 replies

wglambert Feb 9, 2021

rahulsadanandan Feb 23, 2021 Author

yosifkit Feb 24, 2021 Maintainer

debu99 Mar 16, 2021

msolberg8 Apr 16, 2021

markusschaber Jul 6, 2021

rahulsadanandan Jul 12, 2021 Author

rahulsadanandan Jul 27, 2021 Author

michaelklishin Jul 27, 2021 Collaborator

luispabon May 3, 2024

lukebakken May 3, 2024 Collaborator

luispabon May 3, 2024

lukebakken May 3, 2024 Collaborator

rahulsadanandan
Feb 9, 2021

Replies: 8 comments 5 replies

wglambert
Feb 9, 2021

rahulsadanandan
Feb 23, 2021
Author

yosifkit
Feb 24, 2021
Maintainer

debu99
Mar 16, 2021

msolberg8
Apr 16, 2021

rahulsadanandan
Jul 12, 2021
Author

rahulsadanandan
Jul 27, 2021
Author

michaelklishin Jul 27, 2021
Collaborator

luispabon
May 3, 2024

lukebakken May 3, 2024
Collaborator

lukebakken May 3, 2024
Collaborator