DAOS-18834 cart: Add bulk, rpc and corpc metrics#18037
Conversation
- Refactored how per-context metrics are stored: crt_internal_types.h now has a list of metrics and provides helper macros for manipulating them - Added metrics for: rpc send,receive,completed,reply,failures corpc initiate,complete,failures, bulk create,destroy,bind,failures - Old metrics converted to new format Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
|
Ticket title is 'cart: add rpc, corpc and bulk counters' |
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18037/2/execution/node/1032/log |
- Update telemetry_utils.py with new metrics Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18037/5/execution/node/965/log |
Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
| X(CM_RPCS_REPLY_FAILED, D_TM_COUNTER, "Total number of failed replies", "rpcs") \ | ||
| X(CM_RPCS_SENT, D_TM_COUNTER, "Total number of RPCs sent", "rpcs") \ | ||
| X(CM_RPCS_COMPLETED, D_TM_COUNTER, "Total number of RPCs completed successfully", "rpcs") \ | ||
| X(CM_RPCS_COMPLETED_ERR, D_TM_COUNTER, "Total number of sent RPCs completed with error", \ |
There was a problem hiding this comment.
if I follow the previous naming logic, it would be RPC_FWD_FAILED ?
There was a problem hiding this comment.
'FWD' in the naming here (line 473) is for cases when we forward rpc to another target (with bulk binding etc), not for when we call HG_Forward() :) This particular metric is for any rpcs that completed with an error.
There was a problem hiding this comment.
ok well that's definitely confusing then, well also because you had that in the list before the reply, maybe clarify what is forwarded then.
There was a problem hiding this comment.
Rearranged metircs and added wording to FWD metric
| X(CM_RPC_WAITQ_DEPTH, D_TM_GAUGE, "Current count of enqueued RPCs", "rpcs") \ | ||
| X(CM_RPC_QUOTA_EXCEEDED, D_TM_COUNTER, "Total number of exceeded RPC quota events", \ | ||
| "events") \ | ||
| X(CM_RPCS_RECV, D_TM_COUNTER, "Total number of RPCs received", "rpcs") \ |
There was a problem hiding this comment.
would be better to keep CM_RPC prefix without the s to avoid having CM_RPC with and without the plural
There was a problem hiding this comment.
agreed, will change to CM_RPC
Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
|
Test stage Build on EL 8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18037/8/execution/node/250/log |
|
Test stage Build on Leap 15 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18037/8/execution/node/298/log |
|
Test stage Build on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18037/8/execution/node/344/log |
Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18037/10/testReport/ |
|
Looked at both C and Go changes. LGTM. |
|
Verified some of new counters locally by running self_test and monitoring metrics: 14509 - Metric Set: engine_net_cm_bulk_free (Type: Counter) ` |
|
This conflicts with #18161 |
Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
98ab262
kjacque
left a comment
There was a problem hiding this comment.
Go changes still LGTM. :)
| "engine_net_quota_exceeded", | ||
| "engine_net_glitch", | ||
| "engine_net_failed_addr", | ||
| "engine_net_req_timeout", |
There was a problem hiding this comment.
engine_net_req_timeout is still being used by this test so the test needs to be updated with the new metric name. This PR should probably run with Features: telemetry to catch any other issues like this
There was a problem hiding this comment.
good catch, will do with feature
- Linting fixes Features: telemetry Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
eda228a
Features: telemetry Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
Refactored how per-context metrics are stored: crt_internal_types.h now has a list of metrics and provides helper macros for manipulating them
Added metrics for: rpc send,receive,completed,reply,failures corpc initiate,complete,failures, bulk create,destroy,bind,failures
Old metrics converted to new format
Steps for the author:
After all prior steps are complete: