-
Notifications
You must be signed in to change notification settings - Fork 790
[Benchmarks] Pin benchmarks to small set of cores #20403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: sycl
Are you sure you want to change the base?
Conversation
10804b3 to
8855934
Compare
|
Test run on the PVC perf machine: https://github.com/intel/llvm/actions/runs/18679758239/job/53258464661 |
For better results stability, pin benchmark binaries to four cores with the maximum available frequency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some suggestions
| core_frequencies.append((core, freq)) | ||
| core_frequencies.sort(key=lambda x: x[1], reverse=True) | ||
| available_cores = [core for core, _ in core_frequencies[:4]] | ||
| cores_list = ",".join([str(core) for core in available_cores]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cores_list = ",".join([str(core) for core, _ in core_frequencies[:4]])
unless you think two lines are more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's cleaner, done.
| freq = int(f.read().strip()) | ||
| core_frequencies.append((core, freq)) | ||
| core_frequencies.sort(key=lambda x: x[1], reverse=True) | ||
| available_cores = [core for core, _ in core_frequencies[:4]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does numer 4 comes from?
Shouldn't we take all cores that have frequency as the fastest one? + maybe not less than 4
available_cores=[cf[0] for idx, cf in enumerate(core_frequencies) if cf[1] == core_frequencies[0][1] or idx<3]
plus maybe warn if not all 4 first cores have the same frequency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having issues on one setup with the last P-core made me leave exactly 4 cores for benchmarks which seems to be enough for all single-threaded scenarios and for Compute Benchmarks' Memcpy multi-threaded scenarios where 4 threads are used.
I've triggered tests for llama scenarios where the benchmark binary runs 8 threads to see if limiting to 4 cores has any impact on the results: https://github.com/intel/llvm/actions/runs/18712800917.
All setups in CI satisfy the 4 cores with maximum frequency requirement, so no warning added.
| run: | | ||
| # Compute the core range for the first NUMA node; second node is used by | ||
| # UMF. Skip the first 4 cores as the kernel is likely to schedule more | ||
| # UMF. Skip the first 3 cores as the kernel is likely to schedule more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does number 3 come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On one of machines I've had issues with the last P-core - 7th. In order to guarantee 4 P-cores for benchmarks on all CI setups, I use also the core number 3. 4 cores left was a rule of thumb anyway.
8855934 to
e5b651a
Compare
| ) as f: | ||
| freq = int(f.read().strip()) | ||
| core_frequencies.append((core, freq)) | ||
| core_frequencies.sort(key=lambda x: x[1], reverse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we discussed, sorting is not needed as we already assume that we have to take first 4 cores, not other ones because of this faulty 7th core on one of BMGs
Please remove sorting, add warning if selected cores differ in frequency between themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
e5b651a to
670adfc
Compare
| ) | ||
| } | ||
|
|
||
| def taskset_cmd(self) -> list[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means we will nest tasksets. One to run the python script, second to run the actual benchmark. I don't know how these interact, but it seems odd. I'd rather we do these sort of calculations in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I want to leave the maximum possible compute resources for building all the projects but limit available cores to the max frequency ones for benchmark scenarios runs. The difference in total build times is significant if we are to limit builds to just 4 cores.
However, I think that if we are to drop one of these tasksets, I would drop the "outer" one as we can use all the cores from a socket for builds, and move the logic of leaving first cores and picking up ones with the maximum frequency to python.
| dataclasses-json==0.6.7 | ||
| PyYAML==6.0.1 | ||
| Mako==1.3.0 | ||
| psutil>=7.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? it was missing for some reason or is it a mistake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm adding psutil.Process().cpu_affinity() usage. This is a third-party package, not a part of the standard library AFAIK.
For better results stability, pin benchmark binaries to four cores with the maximum available frequency.