Skip to content

DAOS-18834 cart: Add bulk, rpc and corpc metrics#18037

Open
frostedcmos wants to merge 14 commits into
masterfrom
frostedcmos/DAOS-18834
Open

DAOS-18834 cart: Add bulk, rpc and corpc metrics#18037
frostedcmos wants to merge 14 commits into
masterfrom
frostedcmos/DAOS-18834

Conversation

@frostedcmos
Copy link
Copy Markdown
Contributor

  • Refactored how per-context metrics are stored: crt_internal_types.h now has a list of metrics and provides helper macros for manipulating them

  • Added metrics for: rpc send,receive,completed,reply,failures corpc initiate,complete,failures, bulk create,destroy,bind,failures

  • Old metrics converted to new format

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

- Refactored how per-context metrics are stored:
  crt_internal_types.h now has a list of metrics and provides
  helper macros for manipulating them

- Added metrics for:
	rpc send,receive,completed,reply,failures
	corpc initiate,complete,failures,
	bulk create,destroy,bind,failures
- Old metrics converted to new format

Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
@frostedcmos frostedcmos requested review from a team as code owners April 17, 2026 17:08
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 17, 2026

Ticket title is 'cart: add rpc, corpc and bulk counters'
Status is 'In Progress'
https://daosio.atlassian.net/browse/DAOS-18834

@daosbuild3
Copy link
Copy Markdown
Collaborator

- Update telemetry_utils.py with new metrics

Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
@frostedcmos frostedcmos requested review from a team as code owners April 21, 2026 15:45
Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
@frostedcmos frostedcmos marked this pull request as draft April 21, 2026 17:36
@daosbuild3
Copy link
Copy Markdown
Collaborator

Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
@frostedcmos frostedcmos requested a review from soumagne April 22, 2026 21:38
Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
Comment thread src/cart/crt_context.c
Comment thread src/cart/crt_internal_types.h
Comment thread src/cart/crt_internal_types.h Outdated
X(CM_RPCS_REPLY_FAILED, D_TM_COUNTER, "Total number of failed replies", "rpcs") \
X(CM_RPCS_SENT, D_TM_COUNTER, "Total number of RPCs sent", "rpcs") \
X(CM_RPCS_COMPLETED, D_TM_COUNTER, "Total number of RPCs completed successfully", "rpcs") \
X(CM_RPCS_COMPLETED_ERR, D_TM_COUNTER, "Total number of sent RPCs completed with error", \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I follow the previous naming logic, it would be RPC_FWD_FAILED ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'FWD' in the naming here (line 473) is for cases when we forward rpc to another target (with bulk binding etc), not for when we call HG_Forward() :) This particular metric is for any rpcs that completed with an error.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok well that's definitely confusing then, well also because you had that in the list before the reply, maybe clarify what is forwarded then.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rearranged metircs and added wording to FWD metric

Comment thread src/cart/crt_internal_types.h Outdated
X(CM_RPC_WAITQ_DEPTH, D_TM_GAUGE, "Current count of enqueued RPCs", "rpcs") \
X(CM_RPC_QUOTA_EXCEEDED, D_TM_COUNTER, "Total number of exceeded RPC quota events", \
"events") \
X(CM_RPCS_RECV, D_TM_COUNTER, "Total number of RPCs received", "rpcs") \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be better to keep CM_RPC prefix without the s to avoid having CM_RPC with and without the plural

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, will change to CM_RPC

Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18037/10/testReport/

tanabarr
tanabarr previously approved these changes Apr 24, 2026
Copy link
Copy Markdown
Contributor

@tanabarr tanabarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go changes LGTM

@frostedcmos frostedcmos marked this pull request as ready for review April 24, 2026 16:34
@frostedcmos frostedcmos requested review from liw and soumagne April 24, 2026 16:34
@frostedcmos frostedcmos requested a review from jgmoore-or April 24, 2026 17:12
kjacque
kjacque previously approved these changes Apr 24, 2026
@kjacque
Copy link
Copy Markdown
Contributor

kjacque commented Apr 24, 2026

Looked at both C and Go changes. LGTM.

Comment thread src/cart/crt_internal_types.h
soumagne
soumagne previously approved these changes May 1, 2026
@frostedcmos
Copy link
Copy Markdown
Contributor Author

Verified some of new counters locally by running self_test and monitoring metrics:
`14475 - Metric Set: engine_net_cm_bulk_create (Type: Counter)
14476 Total number of bulks created
14477 Metric Labels Value
14478 ------ ------ -----
..
14483 Counter (context=2, provider=ofi+tcp, rank=0) 2

14509 - Metric Set: engine_net_cm_bulk_free (Type: Counter)
14510 Total number of bulks that were freed
14511 Metric Labels Value
14512 ------ ------ -----
...
14517 Counter (context=2, provider=ofi+tcp, rank=0) 2

`

@frostedcmos frostedcmos requested a review from a team May 12, 2026 14:50
@daltonbohning
Copy link
Copy Markdown
Contributor

This conflicts with #18161

@daltonbohning daltonbohning removed the request for review from a team May 12, 2026 16:01
Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
@frostedcmos frostedcmos dismissed stale reviews from soumagne, kjacque, and tanabarr via 98ab262 May 12, 2026 16:36
kjacque
kjacque previously approved these changes May 12, 2026
Copy link
Copy Markdown
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go changes still LGTM. :)

soumagne
soumagne previously approved these changes May 12, 2026
"engine_net_quota_exceeded",
"engine_net_glitch",
"engine_net_failed_addr",
"engine_net_req_timeout",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

engine_net_req_timeout is still being used by this test so the test needs to be updated with the new metric name. This PR should probably run with Features: telemetry to catch any other issues like this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, will do with feature

tanabarr
tanabarr previously approved these changes May 12, 2026
Copy link
Copy Markdown
Contributor

@tanabarr tanabarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go changes LGTM

- Linting fixes

Features: telemetry

Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
@frostedcmos frostedcmos dismissed stale reviews from tanabarr, soumagne, and kjacque via eda228a May 12, 2026 20:25
Features: telemetry

Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

6 participants