Inconsistent message queue handling between roscpp and rospy #1901

sfalexrog · 2020-02-27T13:49:53Z

The problem

While writing some image processing code in Python, we have noticed that if the code is being run inside the callback and is taking a while to complete, the latency would increase significantly over time. This happens even with queue_size=1 set in publisher and subscriber, and is only partially affected by buff_size.

In order to reproduce the problem, I did the following:

wrote a simple node that subscribed to a message, looked at the difference between rospy.Time.now() and the message timestamp, and blocked the callback thread for one second (using rospy.sleep());
wrote the same code in C++ (I've blocked the thread by using std::this_thread::sleep_for);
looked at the resulting differences.

Typical message "staleness" (the difference between the moment when the callback was called and the message timestamp) was close to the time between two published messages for the C++ variant; in Python, however, it was closer to the time it took for the callback to finish.

Test setup

I've tried to write a simple test to determine the staleness for different setups; the code is available here: https://github.com/sfalexrog/staleness_test. I also tried running the code in Travis: https://travis-ci.com/sfalexrog/staleness_test/.

There are three variants for publishing nodes (nodelet - the node is a C++ nodelet; node_cpp - the node is a C++ node; node_py - the node is a Python node) and three variants for subscribing nodes (which have the same names). Additionally, there is a variant of the node_py node, called a node_py_largebuf, which is basically the same as node_py, but with buff_size set to approximately the size of messages generated by the publisher during the callback.

Publishers simulate a camera that generates 640x480 RGB8 images at 100 Hz. Subsribers simulate image processing nodes with long-running code (the message callback blocks for 1 second). Publishing and subscribing nodelets use the same nodelet manager.

Test results

Below are the average results from running on Kinetic (https://travis-ci.com/sfalexrog/staleness_test/jobs/291735129); column names describe the publisher type, row names describe the subscriber type:

	`nodelet`	`node_cpp`	`node_py`
`nodelet`	0.000253	0.005502	0.006024
`node_cpp`	0.005065	0.00517571	0.005521
`node_py`	8.126688	8.159472	8.077343
`node_py_largebuf`	0.944017	0.944568	0.938216

Melodic results are similar: https://travis-ci.com/sfalexrog/staleness_test/jobs/291735130

Questions

Is this behavior expected or documented? I could not find a warning about long-running subscribers in rospy, and the behavior is inconsistent with roscpp. We can work around this problem by creating a separate thread for long-running tasks, but this makes the code more complicated, and we'd prefer a simplier, more reliable Python API.

The text was updated successfully, but these errors were encountered:

mikepurvis · 2020-02-27T14:50:55Z

Unfortunately, there are a lot of warts about the rospy implementation, in particular the baffling thread model and lack of user control over what is scheduled and where.

This is not particularly well documented and is very unlikely to change at this point; the path forward on it is to migrate to ROS 2, which is built as a common client library exposing a C interface for which there are bindings that form the basis of each client library:

This design approach is in part in response to these criticisms of ROS 1. It also makes supporting a new language easier in terms of getting broad API coverage right out of the gate.

sfalexrog · 2020-02-27T17:17:53Z

@mikepurvis,
Thank you for your suggestion. Unfortunately, there is still a number of packages not readily available for ROS2 (although ros1_bridge might probably help with that). The rclpy API is also more explicit, which is a mixed blessing - for instance, service calls no longer look like function calls, results are passed as futures (there's a synchronous call, but looks like it may be broken in Eloquent), and you can't create a Request implicitly. In my opinion ROS2 is not as mature for educational purposes as ROS1 is (although I would be happy to be proven wrong!).

As for this issue, I believe @okalachev has suggested some ways to fix this rospy behavior in the original issue: CopterExpress/clover#218 (comment), CopterExpress/clover#218 (comment). Is there a chance something like that could be done for rospy?

mikepurvis · 2020-02-27T18:58:11Z

Looks like the suggestion is to create another internal thread for each subscriber exclusively dedicated to reading off the socket, so that doing that is never blocked on long callbacks and the kernel buffer never fills up resulting in loss of the newest data.

This does seem pretty reasonable, and doesn't change any of the semantics of the rospy API. It would be unlikely to break any user code.

If it were possible to implement this without a major overhaul of rospy and could be ready in the next few weeks (eg ahead of the release of Noetic), I would consider it. @dirk-thomas, any further thoughts?

slavanap · 2021-02-10T14:09:24Z

Is it possible to just increase socket system buffer for reading?

Long-running callbacks in rospy can cause extreme amounts of buffering resulting in unnecessary delay, essentially ignoring the queue_size setting. This can already by somewhat mitigated by setting buff_size to be larger than the amount of data that could be buffered by a long running callback. However, setting buff_size to a correct value is not possible for the user of the API if the amount of time in the callback or the amount of data that would be transmitted is unknown. Greatly mitigate the delays in such cases by altering the structure of the receive logic. Instead of recv()ing up to buff_size data, then calling the callbacks on every message received, interleave calling recv() between each callback, enforcing queue_size as we go. Also, recv() all data currently available when calling recv() by calling recv() non-blocking after calling it blocking. While it is still possible to have stale data, even with a queue_size of 1, it is less likely, especially if the publisher of the data is on the same host. Even if not, the staleness of the data with a queue_size of 1 is now bounded by the runtime of the callback instead of by buff_size. This mitigation was chosen over a complete fix to the problem because a complete fix would involve a new thread to handle callbacks. While a new thread would allow recv() to be running all the time, even during the long callback, it is a more complex solution. Since rospy is going to be replaced in ROS2, this more tactical mitigation seems appropriate. This mitigates ros#1901

Long-running callbacks in rospy can cause extreme amounts of buffering resulting in unnecessary delay, essentially ignoring the queue_size setting (ros#1901). This can already be somewhat mitigated by setting buff_size to be larger than the amount of data that could be buffered by a long running callback. However, setting buff_size to a correct value is not possible for the user of the API if the amount of time in the callback or the amount of data that would be transmitted is unknown. Also, even with a correct buff_size and a queue_size of 1, the received data may still be the oldest of all data transmitted while the callback was running. Fix the delays in such cases by running callbacks in a separate thread. The receive_loop then calls recv() concurrently with the long running callback, enforcing queue_size as new data is received. This fixes the latency in the data when queue_size is set to be similar to roscpp. This fixes ros#1901

c-andy-martin · 2021-07-14T21:01:38Z

I ran into this problem myself and have coded up a fix, see PR #2170

Long-running callbacks in rospy can cause extreme amounts of buffering resulting in unnecessary delay, essentially ignoring the queue_size setting (ros#1901). This can already be somewhat mitigated by setting buff_size to be larger than the amount of data that could be buffered by a long running callback. However, setting buff_size to a correct value is not possible for the user of the API if the amount of time in the callback or the amount of data that would be transmitted is unknown. Also, even with a correct buff_size and a queue_size of 1, the received data may still be the oldest of all data transmitted while the callback was running. Fix the delays in such cases by running callbacks in a separate thread. The receive_loop then calls recv() concurrently with the long running callback, enforcing queue_size as new data is received. This fixes the latency in the data when queue_size is set to be similar to roscpp. This fixes ros#1901

sfalexrog mentioned this issue Feb 27, 2020

docs: Add note about long running callbacks CopterExpress/clover#218

Open

c-andy-martin mentioned this issue Jul 9, 2021

mitigate delays from rospy long-running callbacks (#1901) BadgerTechnologies/ros_comm#11

Closed

c-andy-martin mentioned this issue Jul 13, 2021

fix delays from rospy long-running callbacks BadgerTechnologies/ros_comm#12

Merged

c-andy-martin linked a pull request Jul 14, 2021 that will close this issue

fix delays from rospy long-running callbacks #2170

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent message queue handling between roscpp and rospy #1901

Inconsistent message queue handling between roscpp and rospy #1901

sfalexrog commented Feb 27, 2020

mikepurvis commented Feb 27, 2020

sfalexrog commented Feb 27, 2020

mikepurvis commented Feb 27, 2020

slavanap commented Feb 10, 2021

c-andy-martin commented Jul 14, 2021

Inconsistent message queue handling between roscpp and rospy #1901

Inconsistent message queue handling between roscpp and rospy #1901

Comments

sfalexrog commented Feb 27, 2020

The problem

Test setup

Test results

Questions

mikepurvis commented Feb 27, 2020

sfalexrog commented Feb 27, 2020

mikepurvis commented Feb 27, 2020

slavanap commented Feb 10, 2021

c-andy-martin commented Jul 14, 2021