Skip to content

Conversation

PatKamin
Copy link
Contributor

For better results stability, pin benchmark binaries to four cores with the maximum available frequency.

@PatKamin
Copy link
Contributor Author

Test run on the PVC perf machine: https://github.com/intel/llvm/actions/runs/18679758239/job/53258464661

@PatKamin PatKamin requested review from a team and vinser52 October 21, 2025 12:06
For better results stability, pin benchmark binaries to four cores with the maximum available frequency.
Copy link
Contributor

@lslusarczyk lslusarczyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some suggestions

core_frequencies.append((core, freq))
core_frequencies.sort(key=lambda x: x[1], reverse=True)
available_cores = [core for core, _ in core_frequencies[:4]]
cores_list = ",".join([str(core) for core in available_cores])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cores_list = ",".join([str(core) for core, _ in core_frequencies[:4]])
unless you think two lines are more readable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's cleaner, done.

freq = int(f.read().strip())
core_frequencies.append((core, freq))
core_frequencies.sort(key=lambda x: x[1], reverse=True)
available_cores = [core for core, _ in core_frequencies[:4]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does numer 4 comes from?

Shouldn't we take all cores that have frequency as the fastest one? + maybe not less than 4

available_cores=[cf[0] for idx, cf in enumerate(core_frequencies) if cf[1] == core_frequencies[0][1] or idx<3]

plus maybe warn if not all 4 first cores have the same frequency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having issues on one setup with the last P-core made me leave exactly 4 cores for benchmarks which seems to be enough for all single-threaded scenarios and for Compute Benchmarks' Memcpy multi-threaded scenarios where 4 threads are used.

I've triggered tests for llama scenarios where the benchmark binary runs 8 threads to see if limiting to 4 cores has any impact on the results: https://github.com/intel/llvm/actions/runs/18712800917.

All setups in CI satisfy the 4 cores with maximum frequency requirement, so no warning added.

run: |
# Compute the core range for the first NUMA node; second node is used by
# UMF. Skip the first 4 cores as the kernel is likely to schedule more
# UMF. Skip the first 3 cores as the kernel is likely to schedule more
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does number 3 come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On one of machines I've had issues with the last P-core - 7th. In order to guarantee 4 P-cores for benchmarks on all CI setups, I use also the core number 3. 4 cores left was a rule of thumb anyway.

) as f:
freq = int(f.read().strip())
core_frequencies.append((core, freq))
core_frequencies.sort(key=lambda x: x[1], reverse=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we discussed, sorting is not needed as we already assume that we have to take first 4 cores, not other ones because of this faulty 7th core on one of BMGs

Please remove sorting, add warning if selected cores differ in frequency between themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

)
}

def taskset_cmd(self) -> list[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we will nest tasksets. One to run the python script, second to run the actual benchmark. I don't know how these interact, but it seems odd. I'd rather we do these sort of calculations in one place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I want to leave the maximum possible compute resources for building all the projects but limit available cores to the max frequency ones for benchmark scenarios runs. The difference in total build times is significant if we are to limit builds to just 4 cores.

However, I think that if we are to drop one of these tasksets, I would drop the "outer" one as we can use all the cores from a socket for builds, and move the logic of leaving first cores and picking up ones with the maximum frequency to python.

dataclasses-json==0.6.7
PyYAML==6.0.1
Mako==1.3.0
psutil>=7.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? it was missing for some reason or is it a mistake?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding psutil.Process().cpu_affinity() usage. This is a third-party package, not a part of the standard library AFAIK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants