Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

post to lkml #1

Open
brendangregg opened this issue Apr 15, 2016 · 47 comments
Open

post to lkml #1

brendangregg opened this issue Apr 15, 2016 · 47 comments

Comments

@brendangregg
Copy link

No description provided.

@webysther
Copy link

The linux-kernel mailing list: http://www.tux.org/lkml/

@jeremyeder
Copy link

/cc @rikvanriel

@brendangregg
Copy link
Author

The paper this refers to is: http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf

@HenkPoley
Copy link

Presentation of the paper is Tuesday 19th April 2016 9:30-10:30 CET: http://eurosys16.doc.ic.ac.uk/program/program/

I suspect that any further news you'll hear past the EuroSys 2016 conference, plus some time to catch up sleep ;)

@brendangregg
Copy link
Author

I think it would be wise for the authors to discuss these findings before the conference, rather than after.

@FlorianHeigl
Copy link

FlorianHeigl commented Apr 19, 2016

If you wanna work on Linux it would be a nice move to also work by the rules, and that means doing as people here suggested.

Old grumps like me will remember that SCO Linux emu used to outperform Linux due to SMP scalability issues. And that that got mostly ignored driven by Linus' "we can tune later" stance which hurt Linux for years. Your paper goes very close to this, erm, sensitive area and your patches were made by you not involving the subsys maintainers for all I can tell.

Normal people would probably just go "oh, wow really you made it that fast?"
On LKML you might find a different reaction.

Do I think Linux' mode of operation via LKML and head-in-sand is silly? Yes.
Does it make sense to not use the way that 1000s of devs agreed on? Hell no.

I can understand you rather wanna sleep than deal with what answers will come in once you post this.
So maybe just write a short notice that you'll presenting this, and make clear you want to join any discussion afterwards.

I'd like to see those patches in but a) unlikely if you piss off a large not-so-open group b) I'd also like to see you come home from the presentation you look forward to and be torn apart on LKML in the aftermath.

So better spend 10 minutes now than days later :-)

@Turbine1991
Copy link

Turbine1991 commented Apr 20, 2016

It's a shame what LKML can be like. It would be nice if they bundled it in as experimental. Prior to this research, I noticed significant performance degradation on a quad-core, with two applications using a core each. I've just compiled a kernel with these patches, looking forward to trying them out.

@webysther
Copy link

I like to think that the community is willing to listen positively to what they have to show.

@HenkPoley
Copy link

There is some discussion now on LKML: https://lkml.org/lkml/2016/4/23/135

@BLepers
Copy link

BLepers commented Apr 25, 2016

Well, it seems that I cannot reply to the thread in LKML (maybe because I'm not subscribed to the list?). Anyway, here is our answer to the existing LKML thread if anyone is interested.


Here are better instructions to reproduce the bugs. Out of the 4 bugs, 3 of them should happen on most NUMA boxes (with 2+ nodes, nothing fancy).

Before anything else: we believe that the bugs can create serious performance issues (especially on big machines), but, as mentioned by Peter, the patches we provide are not intended to be merged in the kernel. These are just hacks that we used to benchmark a "better" behavior of the scheduler for our workload, just to get an idea of performance gains.

The Missing Scheduling Domains bug

echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online

After doing that, the load is no longer balanced between NUMA nodes. This is due to the fact that the NUMA-related sched_domains "disappear" when cpus are switched off and on.

You can easily see the bug using any multi-threaded app after disabling/reenabling a core:
$ any process with a large number of threads
$ htop # all threads should be running on a single NUMA node.

The bug was introduced in 3.19 and still exists in 4.6-rc5. It impacts all NUMA machines if you disable cores (granted, not a lot of people do that).

The Group Imbalance bug

We think that part of this bug comes from the way "load" is computed, when using autogroups. Prior to 4.3 (not sure how it has evolved since then), the load of a task is divided by the sum of the load of all tasks in the autogroup.

Let's take 2 applications, on a 64-cores machine:

  • autogroup 1: 1 thread that uses 100% of a cpu. The thread load is 1024.
  • autogroup 2: 63 threads that use 100% of a cpu each. Each thread has a load of 1024/63.

Suppose the thread of autogroup 1 runs on the node 0 of the machine.
Then when balancing the load between nodes, the load balancer looks at the average load of nodes. The problem is that node 0 has an "artificially" high load because of the thread of autogroup 1, and so the load is not properly balanced (node 0 has a lot of idle cores).

To reproduce the bug, you can launch applications with different thread counts from multiple ssh connexions:
$ ssh machine taskset -c 0 app 1 #1 thread on core 0, node 0
$ ssh machine app 63 # 63 threads

We expect 1 thread per core. In reality core 0 should be busy, but the other cores of node 0 are mostly idle.

I have seen a few messages on LKML complaining about autogroups, so I guess it impacts more machines than our own. This is also more serious that the previous bug, as autogroups are now active by default even on large machines.

The bug was introduced with autogroups (2.6.38) and is still there in 4.6-rc5.

The Scheduling Group Construction bug

This bug arises from the weird topology of our machine. (And I guess is the only bug directly related to the topology.)

Basically a core:

  • ends up in all sched_groups of the last sched_domain of a core. This is bad because then the core is not considered for load balancing at that level.
  • is in no sched_group at lower levels of the sched_domain hierarchy. So the core is not considered for load balancing at lower levels.

Let's say core 0 is in all sched_groups of the last sched_domain of core 8, and that core 0 is in no sched_group of any other sched_domain of core 8.
Then, when launching
taskset -c 0,8 app 100 # 100 threads
core 8 will not steal threads running on core 0. So all threads run on core 0.

The bug was introduced by cb83b629bae0327cf9f44f096adc38d150ceb913 and is still there in 4.6-rc5. You can have a look at the topology of the machine in the paper to better understand what's going on. Peter, I also sent you an email about this bug 2 years ago with extra details if you need more info.

Overload on wakeup

I am not sure if this issue will really be considered as a "bug" here, but here is what happens: part of the bug is because, when waking up, a thread only chooses a core that is on the same NUMA node as that on which it began to sleep.

On workloads that often sleep and wakeup due to barriers, this behavior is suboptimal if other cores in the machine are idle. What happens precisely in our case is described on slide 25 of http://i3s.unice.fr/~jplozi/wastedcores/files/extended_talk.pdf

It is probably easy to reproduce the problem with microbenchmarks (the slides present what happens on "a commercial database" whose name starts by O ;) ), but I don't have any precise one in mind. Hopefully the problem is obvious enough to be understood without any precise command to reproduce it.

We'll be following this thread if you need more information.

@igaw
Copy link

igaw commented Apr 26, 2016

@BLepers LMKL is an open list, no subscription needed. Just avoid HTML. It also uses graylisting, maybe your SMTP has problem handling that one. Please send your reply to the list. It will be ignored if stays here.

From vger.kernel.org:

All email sent to there must be TEXT/PLAIN, there can be no multipart messages, no VCARDs, nothing ``fancy''. In presence of such things, Majordomo will very likely do the wrong thing.

When you send there email, do make sure that all of the email headers, both visible and transport level, have same addresses in them. People experience problems when for example From:'',Sender:'' and possible ``Reply-To:'' headers present different addresses. The most common manifestation is complete silence from VGER!

You can test email delivery between you, and VGER by sending an empty test letter to: [email protected]

@ryoqun
Copy link

ryoqun commented Dec 3, 2016

Hi, I just came from https://blog.acolyer.org/2016/04/26/the-linux-scheduler-a-decade-of-wasted-cores/

This is a great job. Are these findings merged in to the mainline linux kernel? Maybe, with the merged and validated-by-matainers final patch, this paper would really shine. :)

@brendangregg
Copy link
Author

@ryoqun I tested their patches but their findings were not reproducible on the systems at Netflix, with some synthetic workloads. I believe that is because we are running mostly 1 and 2 node NUMA, as is usually the case with cloud guests.

I don't think these findings have been merged, nor discussed properly on lkml. It's only been discussed lightly[1]. I think the paper should have been discussed on lkml before publication, which is why I created this issue.

[1] https://lkml.org/lkml/2016/4/25/176

@AXDOOMER
Copy link

AXDOOMER commented Dec 4, 2016

I think that I read somewhere that some were merged, but not every of them. Changes like this should be merged as fast as possible.

@brendangregg
Copy link
Author

Patch 1

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5942,6 +5942,8 @@ struct sg_lb_stats {
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
 	unsigned int idle_cpus;
 	unsigned int group_weight;
+   unsigned long min_load;
+   unsigned long max_load;
 	enum group_type group_type;
 	int group_no_capacity;

Here's the latest kernel:

	unsigned int sum_nr_running; /* Nr tasks running in the group */
	unsigned int idle_cpus;
	unsigned int group_weight;
	enum group_type group_type;
	int group_no_capacity;

Doesn't look like it was integrated, unless it was rewritten not to have those lines. What do you think? You can dig further to confirm.

Patch 2

@@ -6130,7 +6130,7 @@ static void claim_allocations(int cpu, s
 static int sched_domains_numa_levels;
 enum numa_topology_type sched_numa_topology_type;
 static int *sched_domains_numa_distance;
-int sched_max_numa_distance;
+int sched_max_numa_distance = -1;

The latest kernel:

int sched_max_numa_distance;

Not that bit.

Patch 3

--- linux-4.1.vanilla/kernel/sched/fair.c	2015-06-21 22:05:43.000000000 -0700
+++ linux-4.1.overload-on-wakeup/kernel/sched/fair.c	2015-11-05 01:30:19.693493606 -0800
@@ -4834,10 +4834,39 @@ select_task_rq_fair(struct task_struct *
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
 
+    int _cpu;
+    u64 oldest_idle_stamp = 0xfffffffffffffff;
+    int oldest_idle_stamp_cpu;
+
 	if (sd_flag & SD_BALANCE_WAKE)

The latest kernel:

	int new_cpu = prev_cpu;
	int want_affine = 0;
	int sync = wake_flags & WF_SYNC;

	if (sd_flag & SD_BALANCE_WAKE) {

Not that bit.

Patch 4

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5867,62 +5867,67 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 	struct sd_data *sdd = sd->private;
 	struct sched_domain *sibling;
 	int i;
+   int tries;
 
 	cpumask_clear(covered);
 
-	for_each_cpu(i, span) {

The latest kernel:

	struct cpumask *covered = sched_domains_tmpmask;
	struct sd_data *sdd = sd->private;
	struct sched_domain *sibling;
	int i;

	cpumask_clear(covered);

	for_each_cpu(i, span) {

Not that bit.

@AXDOOMER can you please remember where you read that these were integrated and let us know? Unless these patches have been substantially rewritten to be beyond recognizable, I'd conclude that they have not been integrated, and if anyone is saying otherwise it is misinformation and we should correct it.

@ryoqun
Copy link

ryoqun commented Dec 4, 2016

@brendangregg @AXDOOMER thanks for quick replies! I'll just hope this is merged and this paper's every effort isn't wasted and finally benefited to all of us, maximizing the benefits of the open source nature of linux kernel.. :)

@brendangregg
Copy link
Author

brendangregg commented Dec 4, 2016

@ryoqun do you think you are hurt by these bugs? How many NUMA nodes does your system have? It's believed to only effect systems with high numbers of NUMA nodes. Try running numastat:

# numastat
                           node0
numa_hit              7302760990
numa_miss                      0
numa_foreign                   0
interleave_hit             29838
local_node            7302760990
other_node                     0

That's a typical system for us. Only one column (node0), so one node. All our systems are either one or two node NUMA, and these bugs aren't expected to affect us.

The paper was studying a system with 8 NUMA nodes.

@brendangregg
Copy link
Author

BTW, this still needs to be posted to lkml.

Let me make this abundantly clear: nothing will be fixed until this is posted properly to lkml. Somebody must do this. The most obvious people to do so would be the authors of the paper.

I would volunteer to do this if these bugs hurt the systems at Netflix, but they do not. I might volunteer just "for the good of Linux", but I have other work commitments to do first.

@AXDOOMER
Copy link

AXDOOMER commented Dec 4, 2016

@brendangregg In fact, what I think I have read is that patches were written to fix some issues, so not every issue has a fix. I have read the whole research paper. I don't know if the kernel engineers are aware of a Decade of Wasted Cores, their help would be great.

@brendangregg
Copy link
Author

You get kernel engineering help by posting to lkml and engaging them.

@Turbine1991
Copy link

@brendangregg

While I can understand why such a patch may not affect netflix noticibly, I can tell you under which circumstances this patch shines noticibly.

In regards to game hosting - If we had say 2 instances of a server, each consisting of two major threads, they would in the past schedule the processes on the same core. (Presumanly to save power/reduce heat) When the server's load started to spike, there'd be a massive hitch as it rescheduled the process. Just because the process hanged around 70-80% cpu, it would keep other processes on the same core. Every time the load spiked to 100% it would hitch horribly.

After having applied this patch + disabling the hyper threaded cores (of which don't work to well in the patch), we noticed the load from the beginning was spread across the least used core.

In conclusion while we got overall less performance (likely due to disabling of HT), we got considerably more performance by being able to load up the machine with more processes without affecting each others performance noticibly.

@brendangregg
Copy link
Author

@Turbine1991 Did you try without HT and without the patches?

@Turbine1991
Copy link

@brendangregg That was the second thing I tried, it only made things spikier with multiple processes. I'd just about given up before this patch landed.

@brendangregg
Copy link
Author

@Turbine1991 Are you interested in raising the patches on lkml?

@AXDOOMER
Copy link

AXDOOMER commented Dec 5, 2016

These patches are for Linux 4.1. Will they work with the latest kernel?

@Turbine1991
Copy link

Turbine1991 commented Dec 5, 2016

@brendangregg I don't have the experience to do such a thing.

@AXDOOMER Yep. Feel free to use the easy script on my Github for easily compiling the latest kernel in Ubuntu with this patch applied.

@ryoqun
Copy link

ryoqun commented Dec 5, 2016

@brendangregg Well, I'm not directly affected by these bugs. I was just found that this paper online by chance. And rather got interested, then wanted just to know whether these patches are really merged into the mainline kernel or not.

@AXDOOMER
Copy link

AXDOOMER commented Dec 11, 2016

I have build the modified kernel (4.8.14) with the patches successfully using @Turbine1991's script from his repo. The patches included in this repo are not up-to-date. The group imbalance patch will fail (2 chucks out of 7) and the overload on wakeup patch will also fail.

So I'm running the patched kernel in my VM under PointLinux 3.2. I gave it 4 cores and I built the Linux Kernel 4.8.14 (vanilla) to benchmark.

Built on the patched kernel:

real	4m25.238s
user	13m52.932s
sys	1m25.820s

Built on the vanilla kernel:

real	26m56.151s
user	79m52.472s
sys	7m42.964s

So it's a lot faster. I'm impressed! I couldn't believe it at first, I did it twice to confirm and it was true. I've got these results using "time make -j32", because you want more time consuming threads than your number of cores if you want to check if the kernel balances them in the most optimized way.

@brendangregg
Copy link
Author

brendangregg commented Dec 11, 2016

@AXDOOMER the place to discuss this result is lkml, thanks. Let them know you found a 6x win.

@AXDOOMER
Copy link

AXDOOMER commented Dec 11, 2016

@brendangregg I don't have enough time that I could dedicate to argue with people on the lkml. I'll have more time, but only in two weeks and I'd like to have more benchmarks.The benefits of these fixes only seem to appear when the system is under heavy load. I thought you already started to talk with them: https://lkml.org/lkml/2016/4/23/194

@AXDOOMER
Copy link

I posted this on the lkml:
http://www.spinics.net/lists/kernel/msg2402099.html

@Turbine1991
Copy link

Turbine1991 commented Dec 12, 2016

Could you please give me a few command lines to recreate the environment for building that package?

I figure - running this on a baremetal machine would give it more authenticity. As a VM could potentially skew the results. Running Linux in a Windows environment's VM for example - wouldn't be a good test for real world performance.

@AXDOOMER
Copy link

I don't have a machine on which I can do this right now, that's why I would like other people to test. I don't think the VM helped the patched kernel get faster in any way. Have you tried the patched kernel on your own computer to see if it is faster? I don't see a performance change in normal usage, but when I fully use the CPU, it's much faster.

Note to myself: check if it's faster when putting a heavy load on a single core.

@AXDOOMER
Copy link

AXDOOMER commented Dec 12, 2016

Nobody on the lkml probably cares about this because CFS is being replaced: http://www.phoronix.com/scan.php?page=news_item&px=MuQSS-v0.15-Linux-4.9

BFS is another scheduler and it already performs better than CFS. MuQSS is an improvement to BFS.

@Turbine1991
Copy link

Turbine1991 commented Dec 12, 2016

It's unlikely ever going to be mainlined BFS/MuQSS, due to the reluctance from the primary developer.

In response to the benchmark, I ran unixbench - and got a lower score with the patchset, due to over-scheduling. I haven't directly compared them with both hyper-threading disabled. But really, the beauty of this patch is how it schedules things for per-application performance rather than overall performance. Although I feel it'd do better when operating properly.

The only condition I've felt improved performance as when I had a game-server either being:
-A single process with more than 1 heavy usage thread, which was variable in usage.
-Two processes which use high amounts of cpu usage at variable usage rates.

Both of these scheduled tasks on the same core frequently, when usage spiked - the performance hit was felt.

@brendangregg
Copy link
Author

@AXDOOMER thanks for posting to lkml. Unfortunately merge window just opened (see Linus's email from earlier https://lkml.org/lkml/2016/12/11/102), and maintainers are usually at their busiest working through the backlog of patchsets from the last couple of months during this time. Merge window ends in 2 weeks and things go back to normal, but maybe they'll reply before then.

If one of the scheduler maintainers don't respond (Peter or Ingo), I can follow up on lkml myself.

@Turbine1991
Copy link

Just so we're clear, someone would have to practically create a new scheduler to merge the changes. In it's current state, it's meteor prototype code which does not care about battery life & thermal requirements, it also has minor bugs for certain workloads. Heck, we don't even know how it stacks up against bfq.

@rikvanriel
Copy link

The problems pointed out by the Wasted Cores paper certainly are real.

The issue is that the patches introduce some problems of their own, and some of the problems identified in the paper may need to be fixed in a different way. The NUMA group construction seems like the most obvious fix to get merged.

The rest has some issues of their own. For example, always searching for an idle core anywhere in the system, without taking locality into account creates regressions with some workloads. We still want to search for idle cores more aggressively, but also need to take locality into account somehow.

@brendangregg
Copy link
Author

@rikvanriel how widespread do you think they are? Most of our systems are 1 node NUMA, with some 2 only recently, as I expect will be the rest of EC2 if not all of the cloud. But I suspect you have exposure to different environments.

@rikvanriel
Copy link

@brendangregg the point is now how common those larger systems are, but that we cannot just intentionally break them, in order to make smaller systems work better for some workloads. We need to find solutions (to the problems identified in this paper) that work for all sizes of systems.

@brendangregg
Copy link
Author

@rikvanriel fixing the bugs is one thing, but the paper has claimed that the Linux scheduler is fundamentally broken. I'm interested in understanding both: A) fixes introduced, and B) if the Linux scheduler really has been fundamentally broken, with a decade of wasted cores. If (B) were true, my company would have lost millions due to our choice of Linux on the cloud. It's an incredible claim, and not one that can be ignored.

@rikvanriel
Copy link

@brendangregg these bugs show up with certain workloads, on certain types of machines. For many workloads the amount of CPU time wasted will be small, and for some workloads a little bit of wasted CPU time will be beneficial compared to the downside of running far away from the data (think two tasks communicating through shared memory, sockets, or pipes).

Every workload is different in the locality vs work-preserving trade-off. For some workloads, it is faster to run immediately. For other workloads, it would be faster to run closer to the data, even if it means waiting a short amount of time to run.

NUMA memory access penalties typically range from 20% to 100%, depending on the type and size of the system. Trading away locality for running immediately is not free. Wasting 5% of CPU time, but then running 10-20% faster due to better memory locality is a performance benefit, not a performance penalty.

The bugs pointed out by this paper are real, but some of the proposed fixes are medicines with side effects worse than the disease they point out. That is why I say we may need to fix these issues in different ways.

@Turbine1991
Copy link

For anyone interested in trying the new MuQSS scheduler (formerly bfs) on Ubuntu, you may now use the following script to specify cfs/wastedcores/MuQSS - of course there must be a patch available for the kernel version. I'd recommend 4.8 (stable patch) and 4.9 (unstable patch) for testing.

https://github.com/Turbine1991/build_ubuntu_kernel_wastedcores

@AXDOOMER
Copy link

AXDOOMER commented Dec 14, 2016

My latest benchmarks. @Turbine1991 I ran your script (time sudo ./build.sh) natively on a i7-4790 3.60GHz CPU without disabling HT. They may not be totally accurate, but there is no drastic improvement anymore.

CFS:
real    28m47.847s
user    110m14.624s
sys    7m15.684s

WASTED CORES:
real    29m35.988s
user    119m15.780s
sys    7m18.416s

MuQSS:
real    25m55.705s
user    133m17.376s
sys    4m11.440s

@rikvanriel
Copy link

@AXDOOMER @brendangregg those numbers show nicely why just blindly applying the patches from the "Wasted Cores" paper is not the best idea. The problems identified by the Wasted Cores researchers are real, but the proposed fixes have their own regressions in different areas.

So far nobody has figured out yet how to fix all the identified issues, without causing regressions elsewhere. This might be a good masters or PhD project for someone with a bunch of spare time, who would like to get hired by a company that depends on good Linux performance right after they have obtained their title :)

@AXDOOMER
Copy link

@rikvanriel Sure, blindly applying patches not a good idea, but in some cases, such as when I applied them to my distro in my VM, it made a huge difference. That's why I'm not asking the kernel engineers to apply the patches, I just want them to be aware of the bugs that there is with the current scheduler.

People may still use the patches if it makes a difference for them. @Turbine1991 has a nice script that makes it easy to try every of them.

@Turbine1991
Copy link

I wonder how different these scores are when HT is disabled. Also, make sure your cpu isn't throttling. I suggest installing and using i7z. The 4790k runs really hot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests