Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instrument client/service for end-to-end request/response tracking #145

Merged

Conversation

christophebedard
Copy link
Member

@christophebedard christophebedard commented Nov 17, 2024

Resolves #143

This adds support for instrumenting clients and services for end-to-end tracking of requests/responses. This means:

  1. Associating a request sent by a client to the same request received by a server
  2. Associating a response sent by a server to the original request sent by a client
  3. Associating a response sent by a server to the same response received by a client
    • There is a slight limitation here, see below

The above is achieved with a combination of multiple instrumentation points, mostly in the rmw implementations:

    • rmw_client_init: On client creation, collect the client's GID
    • rmw_send_request: On request publication, collect the client (indirectly corresponding to the client's GID) and the request's sequence number
    • rmw_take_request: On request take, collect the request's client GID and sequence number, which can be matched to the published request
    • rmw_send_response: On response publication, collect the client GID and the sequence number of the request that the response is for, which can be matched to the published/taken request
    • rmw_send_response: On response publication, collect the source timestamp of the response
    • rmw_take_response: On response take, collect the client GID and the sequence number of the request that the response is for and the source timestamp of the response, which can be matched to the published/taken request and published response
      • There is a slight limitation here, see below

Since clients (and services) are an rmw concept and not a DDS concept, selecting the "right" GID to use as the client GID and having it match when read from the client side and the service side was a bit tricky. Refer to the rmw implementation PRs (which need to be merged at the same time as this PR):

This updates some tracing tests in test_tracetools and adds a new test:

  1. test_client: basic validation of client tracepoints
  2. test_service: basic validation of service tracepoints
  3. test_service_req_resp: validate associations of requests/responses

This also updates the design document.


Limitation:

A specific request is, by definition, only sent by a single client and can be received by N>=1 services, but it's still the same request, so there is no issue there. However, there can be multiple responses to the same request, made by multiple services. While having multiple services with the same name is "generally discouraged," it is possible: https://github.com/ros2/rmw/blob/33118c9d4dc2adec838962554f0e09ab5c15d1e0/rmw/include/rmw/rmw.h#L1970-L1972.

To differentiate two service responses to the same request, and therefore know which response the client ends up using, we collect the response's source timestamp when the response is sent and when it is taken. Depending on the clock resolution, we could have collisions here if two services send a response for the same request at the exact same time, but it's unlikely. This is similar to what we do to match a message being published to the same message being received (see #74 and related PRs/issues). We would ideally instead collect the service GID + response sequence number on both sides. However, this is not available for service responses, and DDS-based GIDs are known to be problematic for some rmw implementations: ros2/rmw_cyclonedds#377. This is why we simply use source timestamps for pub/sub messages.


action-ros-ci-repos-override: https://gist.githubusercontent.com/christophebedard/cf4d163feeeabd2cd38cf694413127a0/raw/1fdc769133ff4c412b8283eebcca729f944f755c/ros2.repos

@christophebedard
Copy link
Member Author

All of this to say: would it be worth also collecting source timestamps of sent/taken responses to differentiate multiple responses to the same request?

I thought about it more and asked some people. I think that, even if it's "discouraged" or rare, because it can happen, it would be good to be able to differentiate multiple responses to the same request.

@christophebedard christophebedard force-pushed the christophebedard/instrument-services-end-to-end branch 2 times, most recently from c9240ff to 0d89ee1 Compare November 23, 2024 18:38
@christophebedard christophebedard marked this pull request as ready for review November 23, 2024 18:46
@christophebedard
Copy link
Member Author

I added the source timestamp to the response send/take instrumentation.

This is now ready for review.

christophebedard added a commit to christophebedard/org.eclipse.tracecompass.incubator that referenced this pull request Nov 24, 2024
This adds support for the new client/service instrumentation in ROS 2,
see ros2/ros2_tracing#145.

1. In the objects analysis, create client and service objects
2. In the messages analysis, create the following instances:
    1. Request publication
    2. Request take and callback
    3. Response publication
    4. Response take
    5. Message transport for requests and responses
3. In the messages dataprovider, display the above instances

Signed-off-by: Christophe Bedard <[email protected]>
christophebedard added a commit to christophebedard/org.eclipse.tracecompass.incubator that referenced this pull request Nov 24, 2024
This adds support for the new client/service (i.e., RPC) instrumentation
in ROS 2, see ros2/ros2_tracing#145.

1. In the objects analysis, create client and service objects
2. In the messages analysis, create the following instances:
    1. Request publication
    2. Request take and callback
    3. Response publication
    4. Response take
    5. Message transport for requests and responses
3. In the messages dataprovider, display the above instances

There is one limitation. Normal message publications and message takes
have instrumentation that provides a "start time." For example, for
message publications, the `ros2:rclcpp_publish` tracepoint is the start
and the `ros2:rmw_publish` tracepoint is the end of a message
publication. This allows us to attribute a duration to the publication
and therefore display a time graph state. However, we only have a single
tracepoint for client/service-related publication/take instances, so we
do not have any duration data. For now, just hardcode a 5000 ns duration
so that time graph states are visible enough.

Signed-off-by: Christophe Bedard <[email protected]>
@christophebedard
Copy link
Member Author

If anyone wants to see how this instrumentation can be used, see this PR that adds support for this new instrumentation in Eclipse Trace Compass (see the included screenshot): eclipse-tracecompass-incubator/org.eclipse.tracecompass.incubator#127.

@christophebedard
Copy link
Member Author

@mjcarroll would you be willing to review this?

@christophebedard
Copy link
Member Author

Or maybe @clalancette or @fujitatomoya?

@fujitatomoya
Copy link
Contributor

@christophebedard i think i can give it a shot.

@fujitatomoya
Copy link
Contributor

@christophebedard overall looks good to me, i got several comments though.

@christophebedard christophebedard force-pushed the christophebedard/instrument-services-end-to-end branch from 0d89ee1 to 394d7e0 Compare December 9, 2024 22:35
Copy link
Contributor

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with green CI.

@christophebedard
Copy link
Member Author

Pulls: #145, ros2/rmw_cyclonedds#521, ros2/rmw_fastrtps#787, ros2/rmw_connextdds#163
Gist: https://gist.githubusercontent.com/christophebedard/6de7664f9c4b0103faca7eff9ae7a757/raw/f3665ac1140f461336f4c2444f38bacfcc8f44d8/ros2.repos
BUILD args: --packages-above-and-dependencies test_tracetools tracetools tracetools_trace rmw_cyclonedds_cpp rmw_fastrtps_cpp rmw_fastrtps_dynamic_cpp rmw_fastrtps_shared_cpp rmw_connextdds_common
TEST args: --packages-above test_tracetools tracetools tracetools_trace rmw_cyclonedds_cpp rmw_fastrtps_cpp rmw_fastrtps_dynamic_cpp rmw_fastrtps_shared_cpp rmw_connextdds_common
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/14947

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@christophebedard
Copy link
Member Author

christophebedard commented Dec 12, 2024

I had to manually trigger ci_launcher because CI_USE_FASTRTPS_DYNAMIC needs to be enabled, and Connext DDS is not available on ci_linux-aarch64:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@christophebedard
Copy link
Member Author

Alright, CI looks good and all PRs are approved. Merging all PRs.

@christophebedard christophebedard merged commit 699f572 into rolling Dec 14, 2024
8 of 9 checks passed
@christophebedard christophebedard deleted the christophebedard/instrument-services-end-to-end branch December 14, 2024 20:11
christophebedard added a commit to eclipse-tracecompass-incubator/org.eclipse.tracecompass.incubator that referenced this pull request Dec 14, 2024
This adds support for the new client/service (i.e., RPC) instrumentation
in ROS 2, see ros2/ros2_tracing#145.

1. In the objects analysis, create client and service objects
2. In the messages analysis, create the following instances:
    1. Request publication
    2. Request take and callback
    3. Response publication
    4. Response take
    5. Message transport for requests and responses
3. In the messages dataprovider, display the above instances

There is one limitation. Normal message publications and message takes
have instrumentation that provides a "start time." For example, for
message publications, the `ros2:rclcpp_publish` tracepoint is the start
and the `ros2:rmw_publish` tracepoint is the end of a message
publication. This allows us to attribute a duration to the publication
and therefore display a time graph state. However, we only have a single
tracepoint for client/service-related publication/take instances, so we
do not have any duration data. For now, just hardcode a 5000 ns duration
so that time graph states are visible enough.

Signed-off-by: Christophe Bedard <[email protected]>
@ros-discourse
Copy link

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros-2-visualization-tools-for-architecture/41170/2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Instrument service and client for end-to-end request/response tracking
3 participants