Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when calling rosbag2_transport::Recorder::stop() #1852

Open
rg-gravis opened this issue Nov 4, 2024 · 2 comments
Open

Segfault when calling rosbag2_transport::Recorder::stop() #1852

rg-gravis opened this issue Nov 4, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@rg-gravis
Copy link

Description

We occasionally have an issue where a rosbag2 Recorder node segfaults and dies when stop() is called. This ends up truncating the .mcap file being written, resulting in it being corrupted because it lacks trailing magic bytes.

Expected Behavior

No segfault, the bag is closed and the .mcap has its trailing bytes written.

Actual Behavior

The process dies with code -11. Logs:

[INFO] [1730717198.622841633] [main_bag_rosbag_recorder]: Listening for topics... [rosbag_run_recorder_node-9]
[INFO] [1730717198.622848700] [main_bag_rosbag_recorder]: Event publisher thread: Starting [rosbag_run_recorder_node-9]
[INFO] [1730717198.823090723] [main_bag_rosbag_recorder]: Subscribed to topic '/visualization/machine_interface/target_terrain' [rosbag_run_recorder_node-9]
....
[INFO] [1730717199.253250365] [main_bag_rosbag_recorder]: Overriding subscription profile for /rosout [rosbag_run_recorder_node-9]
[INFO] [1730717199.258032766] [main_bag_rosbag_recorder]: Subscribed to topic '/rosout' [rosbag_run_recorder_node-9]
[INFO] [1730717199.268296194] [main_bag_rosbag_recorder]: Recording... [rosbag_run_recorder_node-9]
...
Error:  [1730717288.011213136] [main_bag_rosbag_recorder]: discovery_future_.wait_for(100) return status: timeout [rosbag_run_recorder_node-9]
Error:  [rosbag_run_recorder_node-9]: process has died [pid 111, exit code -11, cmd '/home/dev_ws/install/rosbag_run_recorder/lib/rosbag_run_recorder/rosbag_run_recorder_node --ros-args --log-level INFO --ros-args'].

Logging from other nodes and many "subscribed to topic" messages have been excluded.

To Reproduce

We have not been able to consistently reproduce, and this error occurs only once out of every 50 or so calls to stop(). Contributing factors seem to be heavy CPU load and larger bags, and it does seem to be somewhat device-dependent. Unfortunately I cannot share the full bag, but its mcap summary is

library:   mcap go v1.4.1; libmcap 0.8.0                               
profile:   ros2                                                        
messages:  743886                                                      
duration:  1m28.990235383s                                             
start:     2024-11-04T11:46:38.837867611+01:00 (1730717198.837867611)  
end:       2024-11-04T11:48:07.828102994+01:00 (1730717287.828102994)  
compression:
	zstd: [36/36 chunks] [143.28 MiB/26.77 MiB (81.31%)] [308.08 KiB/sec] 

and it was recording on 59 topics.

We construct the node roughly as follows:

  rosbag2_storage::StorageOptions storage_options = {
      bagPath.string(), "mcap", 0, 0, 0,
      "zstd_fast",
      "", false};
  rosbag2_transport::RecordOptions record_options;
  record_options.topics = /* our own list of ~100 topics */;
  record_options.topic_qos_profile_overrides = {{"/rosout", rclcpp::RosoutQoS()}};
  record_options.rmw_serialization_format = std::string(rmw_get_serialization_format());

  auto writer = rosbag2_transport::ReaderWriterFactory::make_writer(record_options);

  return std::make_shared<rosbag2_transport::Recorder>(std::move(writer), std::make_shared<KeyboardHandler>(false), storage_options, record_options,"main_bag_rosbag_recorder");

and execute it with a MultiThreadedExecutor constructed using the default arguments and 4 threads (the executor is also running 2 other Recorder nodes but they record far less data and we haven't had any issues with them).

System (please complete the following information)

Additional context

This problem had also occurred when running via ros2 bag record, but there it manifested as the recorder apparently shutting down in response to SIGINT, but for some reason not closing the bag properly.

@rg-gravis rg-gravis added the bug Something isn't working label Nov 4, 2024
@rg-gravis
Copy link
Author

Finally managed to capture an instance of this in GDB:
image(2)

@MichaelOrlov
Copy link
Contributor

@rg-gravis Am I understanding correctly that segfault also reproduces if you are calling Recorder::stop() API directly without sending SIGINT signal? And this is on Humble right?
It would be more helpful for issue analysis to get a core dump after a segfault.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants