Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent message queue handling between roscpp and rospy #1901

Open
sfalexrog opened this issue Feb 27, 2020 · 5 comments · May be fixed by #2170
Open

Inconsistent message queue handling between roscpp and rospy #1901

sfalexrog opened this issue Feb 27, 2020 · 5 comments · May be fixed by #2170

Comments

@sfalexrog
Copy link

(Started at CopterExpress/clover#218)

The problem

While writing some image processing code in Python, we have noticed that if the code is being run inside the callback and is taking a while to complete, the latency would increase significantly over time. This happens even with queue_size=1 set in publisher and subscriber, and is only partially affected by buff_size.

In order to reproduce the problem, I did the following:

  • wrote a simple node that subscribed to a message, looked at the difference between rospy.Time.now() and the message timestamp, and blocked the callback thread for one second (using rospy.sleep());
  • wrote the same code in C++ (I've blocked the thread by using std::this_thread::sleep_for);
  • looked at the resulting differences.

Typical message "staleness" (the difference between the moment when the callback was called and the message timestamp) was close to the time between two published messages for the C++ variant; in Python, however, it was closer to the time it took for the callback to finish.

Test setup

I've tried to write a simple test to determine the staleness for different setups; the code is available here: https://github.com/sfalexrog/staleness_test. I also tried running the code in Travis: https://travis-ci.com/sfalexrog/staleness_test/.

There are three variants for publishing nodes (nodelet - the node is a C++ nodelet; node_cpp - the node is a C++ node; node_py - the node is a Python node) and three variants for subscribing nodes (which have the same names). Additionally, there is a variant of the node_py node, called a node_py_largebuf, which is basically the same as node_py, but with buff_size set to approximately the size of messages generated by the publisher during the callback.

Publishers simulate a camera that generates 640x480 RGB8 images at 100 Hz. Subsribers simulate image processing nodes with long-running code (the message callback blocks for 1 second). Publishing and subscribing nodelets use the same nodelet manager.

Test results

Below are the average results from running on Kinetic (https://travis-ci.com/sfalexrog/staleness_test/jobs/291735129); column names describe the publisher type, row names describe the subscriber type:

nodelet node_cpp node_py
nodelet 0.000253 0.005502 0.006024
node_cpp 0.005065 0.00517571 0.005521
node_py 8.126688 8.159472 8.077343
node_py_largebuf 0.944017 0.944568 0.938216

Melodic results are similar: https://travis-ci.com/sfalexrog/staleness_test/jobs/291735130

Questions

Is this behavior expected or documented? I could not find a warning about long-running subscribers in rospy, and the behavior is inconsistent with roscpp. We can work around this problem by creating a separate thread for long-running tasks, but this makes the code more complicated, and we'd prefer a simplier, more reliable Python API.

@mikepurvis
Copy link
Member

Unfortunately, there are a lot of warts about the rospy implementation, in particular the baffling thread model and lack of user control over what is scheduled and where.

This is not particularly well documented and is very unlikely to change at this point; the path forward on it is to migrate to ROS 2, which is built as a common client library exposing a C interface for which there are bindings that form the basis of each client library:

This design approach is in part in response to these criticisms of ROS 1. It also makes supporting a new language easier in terms of getting broad API coverage right out of the gate.

@sfalexrog
Copy link
Author

@mikepurvis,
Thank you for your suggestion. Unfortunately, there is still a number of packages not readily available for ROS2 (although ros1_bridge might probably help with that). The rclpy API is also more explicit, which is a mixed blessing - for instance, service calls no longer look like function calls, results are passed as futures (there's a synchronous call, but looks like it may be broken in Eloquent), and you can't create a Request implicitly. In my opinion ROS2 is not as mature for educational purposes as ROS1 is (although I would be happy to be proven wrong!).

As for this issue, I believe @okalachev has suggested some ways to fix this rospy behavior in the original issue: CopterExpress/clover#218 (comment), CopterExpress/clover#218 (comment). Is there a chance something like that could be done for rospy?

@mikepurvis
Copy link
Member

Looks like the suggestion is to create another internal thread for each subscriber exclusively dedicated to reading off the socket, so that doing that is never blocked on long callbacks and the kernel buffer never fills up resulting in loss of the newest data.

This does seem pretty reasonable, and doesn't change any of the semantics of the rospy API. It would be unlikely to break any user code.

If it were possible to implement this without a major overhaul of rospy and could be ready in the next few weeks (eg ahead of the release of Noetic), I would consider it. @dirk-thomas, any further thoughts?

@slavanap
Copy link

Is it possible to just increase socket system buffer for reading?

c-andy-martin pushed a commit to BadgerTechnologies/ros_comm that referenced this issue Jul 9, 2021
Long-running callbacks in rospy can cause extreme amounts of
buffering resulting in unnecessary delay, essentially ignoring the
queue_size setting. This can already by somewhat mitigated by setting
buff_size to be larger than the amount of data that could be buffered
by a long running callback. However, setting buff_size to a correct
value is not possible for the user of the API if the amount of time
in the callback or the amount of data that would be transmitted is
unknown.

Greatly mitigate the delays in such cases by altering the structure
of the receive logic. Instead of recv()ing up to buff_size data, then
calling the callbacks on every message received, interleave calling
recv() between each callback, enforcing queue_size as we go. Also,
recv() all data currently available when calling recv() by calling
recv() non-blocking after calling it blocking. While it is still
possible to have stale data, even with a queue_size of 1, it is less
likely, especially if the publisher of the data is on the same host.
Even if not, the staleness of the data with a queue_size of 1 is now
bounded by the runtime of the callback instead of by buff_size.

This mitigation was chosen over a complete fix to the problem
because a complete fix would involve a new thread to handle
callbacks. While a new thread would allow recv() to be running
all the time, even during the long callback, it is a more complex
solution. Since rospy is going to be replaced in ROS2, this more
tactical mitigation seems appropriate.

This mitigates ros#1901
c-andy-martin pushed a commit to BadgerTechnologies/ros_comm that referenced this issue Jul 9, 2021
Long-running callbacks in rospy can cause extreme amounts of
buffering resulting in unnecessary delay, essentially ignoring the
queue_size setting. This can already by somewhat mitigated by setting
buff_size to be larger than the amount of data that could be buffered
by a long running callback. However, setting buff_size to a correct
value is not possible for the user of the API if the amount of time
in the callback or the amount of data that would be transmitted is
unknown.

Greatly mitigate the delays in such cases by altering the structure
of the receive logic. Instead of recv()ing up to buff_size data, then
calling the callbacks on every message received, interleave calling
recv() between each callback, enforcing queue_size as we go. Also,
recv() all data currently available when calling recv() by calling
recv() non-blocking after calling it blocking. While it is still
possible to have stale data, even with a queue_size of 1, it is less
likely, especially if the publisher of the data is on the same host.
Even if not, the staleness of the data with a queue_size of 1 is now
bounded by the runtime of the callback instead of by buff_size.

This mitigation was chosen over a complete fix to the problem
because a complete fix would involve a new thread to handle
callbacks. While a new thread would allow recv() to be running
all the time, even during the long callback, it is a more complex
solution. Since rospy is going to be replaced in ROS2, this more
tactical mitigation seems appropriate.

This mitigates ros#1901
c-andy-martin pushed a commit to BadgerTechnologies/ros_comm that referenced this issue Jul 12, 2021
Long-running callbacks in rospy can cause extreme amounts of
buffering resulting in unnecessary delay, essentially ignoring the
queue_size setting. This can already by somewhat mitigated by setting
buff_size to be larger than the amount of data that could be buffered
by a long running callback. However, setting buff_size to a correct
value is not possible for the user of the API if the amount of time
in the callback or the amount of data that would be transmitted is
unknown.

Greatly mitigate the delays in such cases by altering the structure
of the receive logic. Instead of recv()ing up to buff_size data, then
calling the callbacks on every message received, interleave calling
recv() between each callback, enforcing queue_size as we go. Also,
recv() all data currently available when calling recv() by calling
recv() non-blocking after calling it blocking. While it is still
possible to have stale data, even with a queue_size of 1, it is less
likely, especially if the publisher of the data is on the same host.
Even if not, the staleness of the data with a queue_size of 1 is now
bounded by the runtime of the callback instead of by buff_size.

This mitigation was chosen over a complete fix to the problem
because a complete fix would involve a new thread to handle
callbacks. While a new thread would allow recv() to be running
all the time, even during the long callback, it is a more complex
solution. Since rospy is going to be replaced in ROS2, this more
tactical mitigation seems appropriate.

This mitigates ros#1901
c-andy-martin pushed a commit to BadgerTechnologies/ros_comm that referenced this issue Jul 13, 2021
Long-running callbacks in rospy can cause extreme amounts of buffering
resulting in unnecessary delay, essentially ignoring the queue_size
setting (ros#1901). This can already be somewhat mitigated by setting
buff_size to be larger than the amount of data that could be buffered
by a long running callback. However, setting buff_size to a correct
value is not possible for the user of the API if the amount of time in
the callback or the amount of data that would be transmitted is
unknown. Also, even with a correct buff_size and a queue_size of 1, the
received data may still be the oldest of all data transmitted while the
callback was running.

Fix the delays in such cases by running callbacks in a separate thread.
The receive_loop then calls recv() concurrently with the long running
callback, enforcing queue_size as new data is received. This fixes the
latency in the data when queue_size is set to be similar to roscpp.

This fixes ros#1901
c-andy-martin pushed a commit to BadgerTechnologies/ros_comm that referenced this issue Jul 13, 2021
Long-running callbacks in rospy can cause extreme amounts of buffering
resulting in unnecessary delay, essentially ignoring the queue_size
setting (ros#1901). This can already be somewhat mitigated by setting
buff_size to be larger than the amount of data that could be buffered
by a long running callback. However, setting buff_size to a correct
value is not possible for the user of the API if the amount of time in
the callback or the amount of data that would be transmitted is
unknown. Also, even with a correct buff_size and a queue_size of 1, the
received data may still be the oldest of all data transmitted while the
callback was running.

Fix the delays in such cases by running callbacks in a separate thread.
The receive_loop then calls recv() concurrently with the long running
callback, enforcing queue_size as new data is received. This fixes the
latency in the data when queue_size is set to be similar to roscpp.

This fixes ros#1901
c-andy-martin pushed a commit to BadgerTechnologies/ros_comm that referenced this issue Jul 13, 2021
Long-running callbacks in rospy can cause extreme amounts of buffering
resulting in unnecessary delay, essentially ignoring the queue_size
setting (ros#1901). This can already be somewhat mitigated by setting
buff_size to be larger than the amount of data that could be buffered
by a long running callback. However, setting buff_size to a correct
value is not possible for the user of the API if the amount of time in
the callback or the amount of data that would be transmitted is
unknown. Also, even with a correct buff_size and a queue_size of 1, the
received data may still be the oldest of all data transmitted while the
callback was running.

Fix the delays in such cases by running callbacks in a separate thread.
The receive_loop then calls recv() concurrently with the long running
callback, enforcing queue_size as new data is received. This fixes the
latency in the data when queue_size is set to be similar to roscpp.

This fixes ros#1901
c-andy-martin pushed a commit to BadgerTechnologies/ros_comm that referenced this issue Jul 14, 2021
Long-running callbacks in rospy can cause extreme amounts of buffering
resulting in unnecessary delay, essentially ignoring the queue_size
setting (ros#1901). This can already be somewhat mitigated by setting
buff_size to be larger than the amount of data that could be buffered
by a long running callback. However, setting buff_size to a correct
value is not possible for the user of the API if the amount of time in
the callback or the amount of data that would be transmitted is
unknown. Also, even with a correct buff_size and a queue_size of 1, the
received data may still be the oldest of all data transmitted while the
callback was running.

Fix the delays in such cases by running callbacks in a separate thread.
The receive_loop then calls recv() concurrently with the long running
callback, enforcing queue_size as new data is received. This fixes the
latency in the data when queue_size is set to be similar to roscpp.

This fixes ros#1901
@c-andy-martin c-andy-martin linked a pull request Jul 14, 2021 that will close this issue
@c-andy-martin
Copy link
Contributor

I ran into this problem myself and have coded up a fix, see PR #2170

c-andy-martin pushed a commit to BadgerTechnologies/ros_comm that referenced this issue Jul 20, 2021
Long-running callbacks in rospy can cause extreme amounts of buffering
resulting in unnecessary delay, essentially ignoring the queue_size
setting (ros#1901). This can already be somewhat mitigated by setting
buff_size to be larger than the amount of data that could be buffered
by a long running callback. However, setting buff_size to a correct
value is not possible for the user of the API if the amount of time in
the callback or the amount of data that would be transmitted is
unknown. Also, even with a correct buff_size and a queue_size of 1, the
received data may still be the oldest of all data transmitted while the
callback was running.

Fix the delays in such cases by running callbacks in a separate thread.
The receive_loop then calls recv() concurrently with the long running
callback, enforcing queue_size as new data is received. This fixes the
latency in the data when queue_size is set to be similar to roscpp.

This fixes ros#1901
c-andy-martin pushed a commit to BadgerTechnologies/ros_comm that referenced this issue Jul 20, 2021
Long-running callbacks in rospy can cause extreme amounts of buffering
resulting in unnecessary delay, essentially ignoring the queue_size
setting (ros#1901). This can already be somewhat mitigated by setting
buff_size to be larger than the amount of data that could be buffered
by a long running callback. However, setting buff_size to a correct
value is not possible for the user of the API if the amount of time in
the callback or the amount of data that would be transmitted is
unknown. Also, even with a correct buff_size and a queue_size of 1, the
received data may still be the oldest of all data transmitted while the
callback was running.

Fix the delays in such cases by running callbacks in a separate thread.
The receive_loop then calls recv() concurrently with the long running
callback, enforcing queue_size as new data is received. This fixes the
latency in the data when queue_size is set to be similar to roscpp.

This fixes ros#1901
c-andy-martin pushed a commit to BadgerTechnologies/ros_comm that referenced this issue Jul 20, 2021
Long-running callbacks in rospy can cause extreme amounts of buffering
resulting in unnecessary delay, essentially ignoring the queue_size
setting (ros#1901). This can already be somewhat mitigated by setting
buff_size to be larger than the amount of data that could be buffered
by a long running callback. However, setting buff_size to a correct
value is not possible for the user of the API if the amount of time in
the callback or the amount of data that would be transmitted is
unknown. Also, even with a correct buff_size and a queue_size of 1, the
received data may still be the oldest of all data transmitted while the
callback was running.

Fix the delays in such cases by running callbacks in a separate thread.
The receive_loop then calls recv() concurrently with the long running
callback, enforcing queue_size as new data is received. This fixes the
latency in the data when queue_size is set to be similar to roscpp.

This fixes ros#1901
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants