-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
post to lkml #1
Comments
The linux-kernel mailing list: http://www.tux.org/lkml/ |
/cc @rikvanriel |
The paper this refers to is: http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf |
Presentation of the paper is Tuesday 19th April 2016 9:30-10:30 CET: http://eurosys16.doc.ic.ac.uk/program/program/ I suspect that any further news you'll hear past the EuroSys 2016 conference, plus some time to catch up sleep ;) |
I think it would be wise for the authors to discuss these findings before the conference, rather than after. |
If you wanna work on Linux it would be a nice move to also work by the rules, and that means doing as people here suggested. Old grumps like me will remember that SCO Linux emu used to outperform Linux due to SMP scalability issues. And that that got mostly ignored driven by Linus' "we can tune later" stance which hurt Linux for years. Your paper goes very close to this, erm, sensitive area and your patches were made by you not involving the subsys maintainers for all I can tell. Normal people would probably just go "oh, wow really you made it that fast?" Do I think Linux' mode of operation via LKML and head-in-sand is silly? Yes. I can understand you rather wanna sleep than deal with what answers will come in once you post this. I'd like to see those patches in but a) unlikely if you piss off a large not-so-open group b) I'd also like to see you come home from the presentation you look forward to and be torn apart on LKML in the aftermath. So better spend 10 minutes now than days later :-) |
It's a shame what LKML can be like. It would be nice if they bundled it in as experimental. Prior to this research, I noticed significant performance degradation on a quad-core, with two applications using a core each. I've just compiled a kernel with these patches, looking forward to trying them out. |
I like to think that the community is willing to listen positively to what they have to show. |
There is some discussion now on LKML: https://lkml.org/lkml/2016/4/23/135 |
Well, it seems that I cannot reply to the thread in LKML (maybe because I'm not subscribed to the list?). Anyway, here is our answer to the existing LKML thread if anyone is interested. Here are better instructions to reproduce the bugs. Out of the 4 bugs, 3 of them should happen on most NUMA boxes (with 2+ nodes, nothing fancy). Before anything else: we believe that the bugs can create serious performance issues (especially on big machines), but, as mentioned by Peter, the patches we provide are not intended to be merged in the kernel. These are just hacks that we used to benchmark a "better" behavior of the scheduler for our workload, just to get an idea of performance gains. The Missing Scheduling Domains bugecho 0 > /sys/devices/system/cpu/cpu1/online After doing that, the load is no longer balanced between NUMA nodes. This is due to the fact that the NUMA-related sched_domains "disappear" when cpus are switched off and on. You can easily see the bug using any multi-threaded app after disabling/reenabling a core: The bug was introduced in 3.19 and still exists in 4.6-rc5. It impacts all NUMA machines if you disable cores (granted, not a lot of people do that). The Group Imbalance bugWe think that part of this bug comes from the way "load" is computed, when using autogroups. Prior to 4.3 (not sure how it has evolved since then), the load of a task is divided by the sum of the load of all tasks in the autogroup. Let's take 2 applications, on a 64-cores machine:
Suppose the thread of autogroup 1 runs on the node 0 of the machine. To reproduce the bug, you can launch applications with different thread counts from multiple ssh connexions: We expect 1 thread per core. In reality core 0 should be busy, but the other cores of node 0 are mostly idle. I have seen a few messages on LKML complaining about autogroups, so I guess it impacts more machines than our own. This is also more serious that the previous bug, as autogroups are now active by default even on large machines. The bug was introduced with autogroups (2.6.38) and is still there in 4.6-rc5. The Scheduling Group Construction bugThis bug arises from the weird topology of our machine. (And I guess is the only bug directly related to the topology.) Basically a core:
Let's say core 0 is in all sched_groups of the last sched_domain of core 8, and that core 0 is in no sched_group of any other sched_domain of core 8. The bug was introduced by cb83b629bae0327cf9f44f096adc38d150ceb913 and is still there in 4.6-rc5. You can have a look at the topology of the machine in the paper to better understand what's going on. Peter, I also sent you an email about this bug 2 years ago with extra details if you need more info. Overload on wakeupI am not sure if this issue will really be considered as a "bug" here, but here is what happens: part of the bug is because, when waking up, a thread only chooses a core that is on the same NUMA node as that on which it began to sleep. On workloads that often sleep and wakeup due to barriers, this behavior is suboptimal if other cores in the machine are idle. What happens precisely in our case is described on slide 25 of http://i3s.unice.fr/~jplozi/wastedcores/files/extended_talk.pdf It is probably easy to reproduce the problem with microbenchmarks (the slides present what happens on "a commercial database" whose name starts by O ;) ), but I don't have any precise one in mind. Hopefully the problem is obvious enough to be understood without any precise command to reproduce it. We'll be following this thread if you need more information. |
@BLepers LMKL is an open list, no subscription needed. Just avoid HTML. It also uses graylisting, maybe your SMTP has problem handling that one. Please send your reply to the list. It will be ignored if stays here. From vger.kernel.org: All email sent to there must be TEXT/PLAIN, there can be no multipart messages, no VCARDs, nothing ``fancy''. In presence of such things, Majordomo will very likely do the wrong thing. When you send there email, do make sure that all of the email headers, both visible and transport level, have same addresses in them. People experience problems when for example You can test email delivery between you, and VGER by sending an empty test letter to: [email protected] |
Hi, I just came from https://blog.acolyer.org/2016/04/26/the-linux-scheduler-a-decade-of-wasted-cores/ This is a great job. Are these findings merged in to the mainline linux kernel? Maybe, with the merged and validated-by-matainers final patch, this paper would really shine. :) |
@ryoqun I tested their patches but their findings were not reproducible on the systems at Netflix, with some synthetic workloads. I believe that is because we are running mostly 1 and 2 node NUMA, as is usually the case with cloud guests. I don't think these findings have been merged, nor discussed properly on lkml. It's only been discussed lightly[1]. I think the paper should have been discussed on lkml before publication, which is why I created this issue. |
I think that I read somewhere that some were merged, but not every of them. Changes like this should be merged as fast as possible. |
Patch 1--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5942,6 +5942,8 @@ struct sg_lb_stats {
unsigned int sum_nr_running; /* Nr tasks running in the group */
unsigned int idle_cpus;
unsigned int group_weight;
+ unsigned long min_load;
+ unsigned long max_load;
enum group_type group_type;
int group_no_capacity; Here's the latest kernel: unsigned int sum_nr_running; /* Nr tasks running in the group */
unsigned int idle_cpus;
unsigned int group_weight;
enum group_type group_type;
int group_no_capacity; Doesn't look like it was integrated, unless it was rewritten not to have those lines. What do you think? You can dig further to confirm. Patch 2@@ -6130,7 +6130,7 @@ static void claim_allocations(int cpu, s
static int sched_domains_numa_levels;
enum numa_topology_type sched_numa_topology_type;
static int *sched_domains_numa_distance;
-int sched_max_numa_distance;
+int sched_max_numa_distance = -1; The latest kernel: int sched_max_numa_distance; Not that bit. Patch 3--- linux-4.1.vanilla/kernel/sched/fair.c 2015-06-21 22:05:43.000000000 -0700
+++ linux-4.1.overload-on-wakeup/kernel/sched/fair.c 2015-11-05 01:30:19.693493606 -0800
@@ -4834,10 +4834,39 @@ select_task_rq_fair(struct task_struct *
int want_affine = 0;
int sync = wake_flags & WF_SYNC;
+ int _cpu;
+ u64 oldest_idle_stamp = 0xfffffffffffffff;
+ int oldest_idle_stamp_cpu;
+
if (sd_flag & SD_BALANCE_WAKE) The latest kernel:
Not that bit. Patch 4--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5867,62 +5867,67 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
struct sd_data *sdd = sd->private;
struct sched_domain *sibling;
int i;
+ int tries;
cpumask_clear(covered);
- for_each_cpu(i, span) { The latest kernel: struct cpumask *covered = sched_domains_tmpmask;
struct sd_data *sdd = sd->private;
struct sched_domain *sibling;
int i;
cpumask_clear(covered);
for_each_cpu(i, span) { Not that bit. @AXDOOMER can you please remember where you read that these were integrated and let us know? Unless these patches have been substantially rewritten to be beyond recognizable, I'd conclude that they have not been integrated, and if anyone is saying otherwise it is misinformation and we should correct it. |
@brendangregg @AXDOOMER thanks for quick replies! I'll just hope this is merged and this paper's every effort isn't wasted and finally benefited to all of us, maximizing the benefits of the open source nature of linux kernel.. :) |
@ryoqun do you think you are hurt by these bugs? How many NUMA nodes does your system have? It's believed to only effect systems with high numbers of NUMA nodes. Try running numastat: # numastat
node0
numa_hit 7302760990
numa_miss 0
numa_foreign 0
interleave_hit 29838
local_node 7302760990
other_node 0 That's a typical system for us. Only one column (node0), so one node. All our systems are either one or two node NUMA, and these bugs aren't expected to affect us. The paper was studying a system with 8 NUMA nodes. |
BTW, this still needs to be posted to lkml. Let me make this abundantly clear: nothing will be fixed until this is posted properly to lkml. Somebody must do this. The most obvious people to do so would be the authors of the paper. I would volunteer to do this if these bugs hurt the systems at Netflix, but they do not. I might volunteer just "for the good of Linux", but I have other work commitments to do first. |
@brendangregg In fact, what I think I have read is that patches were written to fix some issues, so not every issue has a fix. I have read the whole research paper. I don't know if the kernel engineers are aware of a Decade of Wasted Cores, their help would be great. |
You get kernel engineering help by posting to lkml and engaging them. |
While I can understand why such a patch may not affect netflix noticibly, I can tell you under which circumstances this patch shines noticibly. In regards to game hosting - If we had say 2 instances of a server, each consisting of two major threads, they would in the past schedule the processes on the same core. (Presumanly to save power/reduce heat) When the server's load started to spike, there'd be a massive hitch as it rescheduled the process. Just because the process hanged around 70-80% cpu, it would keep other processes on the same core. Every time the load spiked to 100% it would hitch horribly. After having applied this patch + disabling the hyper threaded cores (of which don't work to well in the patch), we noticed the load from the beginning was spread across the least used core. In conclusion while we got overall less performance (likely due to disabling of HT), we got considerably more performance by being able to load up the machine with more processes without affecting each others performance noticibly. |
@Turbine1991 Did you try without HT and without the patches? |
@brendangregg That was the second thing I tried, it only made things spikier with multiple processes. I'd just about given up before this patch landed. |
@Turbine1991 Are you interested in raising the patches on lkml? |
These patches are for Linux 4.1. Will they work with the latest kernel? |
@brendangregg I don't have the experience to do such a thing. @AXDOOMER Yep. Feel free to use the easy script on my Github for easily compiling the latest kernel in Ubuntu with this patch applied. |
@brendangregg Well, I'm not directly affected by these bugs. I was just found that this paper online by chance. And rather got interested, then wanted just to know whether these patches are really merged into the mainline kernel or not. |
I have build the modified kernel (4.8.14) with the patches successfully using @Turbine1991's script from his repo. The patches included in this repo are not up-to-date. The group imbalance patch will fail (2 chucks out of 7) and the overload on wakeup patch will also fail. So I'm running the patched kernel in my VM under PointLinux 3.2. I gave it 4 cores and I built the Linux Kernel 4.8.14 (vanilla) to benchmark. Built on the patched kernel:
Built on the vanilla kernel:
So it's a lot faster. I'm impressed! I couldn't believe it at first, I did it twice to confirm and it was true. I've got these results using "time make -j32", because you want more time consuming threads than your number of cores if you want to check if the kernel balances them in the most optimized way. |
@AXDOOMER the place to discuss this result is lkml, thanks. Let them know you found a 6x win. |
@brendangregg I don't have enough time that I could dedicate to argue with people on the lkml. I'll have more time, but only in two weeks and I'd like to have more benchmarks.The benefits of these fixes only seem to appear when the system is under heavy load. I thought you already started to talk with them: https://lkml.org/lkml/2016/4/23/194 |
I posted this on the lkml: |
Could you please give me a few command lines to recreate the environment for building that package? I figure - running this on a baremetal machine would give it more authenticity. As a VM could potentially skew the results. Running Linux in a Windows environment's VM for example - wouldn't be a good test for real world performance. |
I don't have a machine on which I can do this right now, that's why I would like other people to test. I don't think the VM helped the patched kernel get faster in any way. Have you tried the patched kernel on your own computer to see if it is faster? I don't see a performance change in normal usage, but when I fully use the CPU, it's much faster. Note to myself: check if it's faster when putting a heavy load on a single core. |
Nobody on the lkml probably cares about this because CFS is being replaced: http://www.phoronix.com/scan.php?page=news_item&px=MuQSS-v0.15-Linux-4.9 BFS is another scheduler and it already performs better than CFS. MuQSS is an improvement to BFS. |
It's unlikely ever going to be mainlined BFS/MuQSS, due to the reluctance from the primary developer. In response to the benchmark, I ran unixbench - and got a lower score with the patchset, due to over-scheduling. I haven't directly compared them with both hyper-threading disabled. But really, the beauty of this patch is how it schedules things for per-application performance rather than overall performance. Although I feel it'd do better when operating properly. The only condition I've felt improved performance as when I had a game-server either being: Both of these scheduled tasks on the same core frequently, when usage spiked - the performance hit was felt. |
@AXDOOMER thanks for posting to lkml. Unfortunately merge window just opened (see Linus's email from earlier https://lkml.org/lkml/2016/12/11/102), and maintainers are usually at their busiest working through the backlog of patchsets from the last couple of months during this time. Merge window ends in 2 weeks and things go back to normal, but maybe they'll reply before then. If one of the scheduler maintainers don't respond (Peter or Ingo), I can follow up on lkml myself. |
Just so we're clear, someone would have to practically create a new scheduler to merge the changes. In it's current state, it's meteor prototype code which does not care about battery life & thermal requirements, it also has minor bugs for certain workloads. Heck, we don't even know how it stacks up against bfq. |
The problems pointed out by the Wasted Cores paper certainly are real. The issue is that the patches introduce some problems of their own, and some of the problems identified in the paper may need to be fixed in a different way. The NUMA group construction seems like the most obvious fix to get merged. The rest has some issues of their own. For example, always searching for an idle core anywhere in the system, without taking locality into account creates regressions with some workloads. We still want to search for idle cores more aggressively, but also need to take locality into account somehow. |
@rikvanriel how widespread do you think they are? Most of our systems are 1 node NUMA, with some 2 only recently, as I expect will be the rest of EC2 if not all of the cloud. But I suspect you have exposure to different environments. |
@brendangregg the point is now how common those larger systems are, but that we cannot just intentionally break them, in order to make smaller systems work better for some workloads. We need to find solutions (to the problems identified in this paper) that work for all sizes of systems. |
@rikvanriel fixing the bugs is one thing, but the paper has claimed that the Linux scheduler is fundamentally broken. I'm interested in understanding both: A) fixes introduced, and B) if the Linux scheduler really has been fundamentally broken, with a decade of wasted cores. If (B) were true, my company would have lost millions due to our choice of Linux on the cloud. It's an incredible claim, and not one that can be ignored. |
@brendangregg these bugs show up with certain workloads, on certain types of machines. For many workloads the amount of CPU time wasted will be small, and for some workloads a little bit of wasted CPU time will be beneficial compared to the downside of running far away from the data (think two tasks communicating through shared memory, sockets, or pipes). Every workload is different in the locality vs work-preserving trade-off. For some workloads, it is faster to run immediately. For other workloads, it would be faster to run closer to the data, even if it means waiting a short amount of time to run. NUMA memory access penalties typically range from 20% to 100%, depending on the type and size of the system. Trading away locality for running immediately is not free. Wasting 5% of CPU time, but then running 10-20% faster due to better memory locality is a performance benefit, not a performance penalty. The bugs pointed out by this paper are real, but some of the proposed fixes are medicines with side effects worse than the disease they point out. That is why I say we may need to fix these issues in different ways. |
For anyone interested in trying the new MuQSS scheduler (formerly bfs) on Ubuntu, you may now use the following script to specify cfs/wastedcores/MuQSS - of course there must be a patch available for the kernel version. I'd recommend 4.8 (stable patch) and 4.9 (unstable patch) for testing. https://github.com/Turbine1991/build_ubuntu_kernel_wastedcores |
My latest benchmarks. @Turbine1991 I ran your script (time sudo ./build.sh) natively on a i7-4790 3.60GHz CPU without disabling HT. They may not be totally accurate, but there is no drastic improvement anymore.
|
@AXDOOMER @brendangregg those numbers show nicely why just blindly applying the patches from the "Wasted Cores" paper is not the best idea. The problems identified by the Wasted Cores researchers are real, but the proposed fixes have their own regressions in different areas. So far nobody has figured out yet how to fix all the identified issues, without causing regressions elsewhere. This might be a good masters or PhD project for someone with a bunch of spare time, who would like to get hired by a company that depends on good Linux performance right after they have obtained their title :) |
@rikvanriel Sure, blindly applying patches not a good idea, but in some cases, such as when I applied them to my distro in my VM, it made a huge difference. That's why I'm not asking the kernel engineers to apply the patches, I just want them to be aware of the bugs that there is with the current scheduler. People may still use the patches if it makes a difference for them. @Turbine1991 has a nice script that makes it easy to try every of them. |
I wonder how different these scores are when HT is disabled. Also, make sure your cpu isn't throttling. I suggest installing and using i7z. The 4790k runs really hot. |
No description provided.
The text was updated successfully, but these errors were encountered: