Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add topology option #354

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

xrmzju
Copy link

@xrmzju xrmzju commented Dec 7, 2023

  1. add topology option: DisableNodeCaches, DisableNodeAreas, DisableNodeDistances when collect topology
  2. pop err to ctx when collect topologyNodes so that the caller can know it

@jaypipes
Copy link
Owner

jaypipes commented Dec 9, 2023

@xrmzju Hi! Thanks for your submission! Can you tell me a little more about why you want to disable collection of NUMA topology information?

@jaypipes jaypipes requested review from ffromani and jaypipes December 9, 2023 16:36
Copy link
Collaborator

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also love to know about the usecase. I can guess in some scenario the collection of topology data fails? If my guess is right I'd rather fix this usecase.

Comment on lines +28 to +30
DisableNodeCaches bool
DisableNodeAreas bool
DisableNodeDistances bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this (kinda unprecedented) level of detail or can we just have a single option controlling everything?

Comment on lines +119 to +121
DisableNodeCaches: option.EnvOrDefaultDisableNodeCaches(),
DisableNodeAreas: option.EnvOrDefaultDisableNodeAreas(),
DisableNodeDistances: option.EnvOrDefaultDisableNodeDistances(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to say to keep consistent with other options are handled even though we should later perhaps refactor. But I'd be consistent meantime.

@@ -46,32 +46,37 @@ func topologyNodes(ctx *context.Context) []*Node {
nodeID, err := strconv.Atoi(filename[4:])
if err != nil {
ctx.Warn("failed to determine node ID: %s\n", err)
ctx.Err = err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this usecase makes sense, but I'd rather add a new function or enhance the Warn helper to also set err, without exposing the field (just yet)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry i did not get your point, could you give a example code please?

@xrmzju
Copy link
Author

xrmzju commented Dec 11, 2023

some scenario the collection of topology data fails

Yes, your guess is correct. There was a failure in collecting memory area for a certain collection of topology data, but no error was raised. As a result, my program continued to run with incorrect topology information. In my scenario, I only require the CPU NUMA topology information and I do not need to consider the NodeCaches, NodeAreas, and NodeDistances. so i made the modification above

@ffromani
Copy link
Collaborator

ffromani commented Dec 12, 2023

some scenario the collection of topology data fails

Yes, your guess is correct. There was a failure in collecting memory area for a certain collection of topology data, but no error was raised. As a result, my program continued to run with incorrect topology information. In my scenario, I only require the CPU NUMA topology information and I do not need to consider the NodeCaches, NodeAreas, and NodeDistances. so i made the modification above

Thanks for clarifying. Could you please share a description of the hardware on which the collection fails? E.g was it a regular NUMA x86 machine? Perhaps you were using (relatively) new technology like CXL? Or was it arm?

In general I'm reluctant to add so fine control about collection of information - adds too many knobs and makes the code less regular, so I'd like to learn more about the usecase.

@xrmzju
Copy link
Author

xrmzju commented Dec 12, 2023

some scenario the collection of topology data fails

Yes, your guess is correct. There was a failure in collecting memory area for a certain collection of topology data, but no error was raised. As a result, my program continued to run with incorrect topology information. In my scenario, I only require the CPU NUMA topology information and I do not need to consider the NodeCaches, NodeAreas, and NodeDistances. so i made the modification above

Thanks for clarifying. Could you please share a description of the hardware on which the collection fails? E.g was it a regular NUMA x86 machine? Perhaps you were using (relatively) new technology like CXL? Or was it arm?

In general I'm reluctant to add so fine control about collection of information - adds too many knobs and makes the code less regular, so I'd like to learn more about the usecase.

image Here is the failure message. The operating system we are using has been specifically designed by our internal team, which means there might be some bugs or unique features. However, in my current scenario, I do not require any memory area information. Therefore, I have been attempting to find a solution to bypass or skip it. I feel free to add less fine control about collection of information, maybe some option like `CPUTopologyOnly` which will disable collecting `NodeCaches` and `NodeAreas`?

@ffromani
Copy link
Collaborator

@xrmzju thanks for sharing. I'll try to think about a more generic solution. I'll get back ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants