-
Notifications
You must be signed in to change notification settings - Fork 2
runtime: add runtime.BackgroundYield() #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: cockroach-go1.23.12
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with experimenting with this Go runtime enhancement. I definitely want to see experimental evidence of the benefit.
PS Should probably update the print in schedtrace
to include sched.bgqsize
.
src/runtime/proc.go
Outdated
if gp.lockedm != 0 || gp.m.lockedg != 0 || gp.m.locks > 0 { | ||
return | ||
} | ||
if sched.runqsize > 0 || (gp.m.p != 0 && !runqempty(gp.m.p.ptr())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you need to hold sched.lock
when checking sched.runqsize
and sched.bgsqize
? Hmm, it looks like runqsize
is sometimes checked without the lock and sometimes with, though it is always modified with the lock held. Seems like the runtime authors just assume this is understood. Worth dropping a comment here about the safety.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I had a similar question when I was reading findRunnable
and noticed reads of runqsize
without holding the lock and it is not using atomics and indeed specifically asked ChatGPT to look for all such cases and outline what's going on. My take-away was that it being done where the compiler isn't going to hoist the load entirely out of the loop or something, so it will actually be a load that's just a the mercy of cache coherence delays, and that slightly stale read is good enough for the sort of "first pass" decisions it is used for, with critical decisions then falling back to a locked read like the one at the bottom of findRunnable if the unlocked reads didn't find anything. Yielding seems like just such a case, where it is more important we be cheap than perfect, as we'll just yield a few checks later once our cache updates.
src/runtime/proc.go
Outdated
} | ||
|
||
gp := getg() | ||
// Don't yield if locked to an OS thread or holding runtime locks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought runtime locks were only held during internal runtime operations. That is, runtime locks can't be held if application code is calling BackgroundYield
. Is that check purely defensive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, just an abundance of caution since it would be not good to park while holding one. But I don't really see how you could unless the runtime itself were to use this, eg in the gc loop (that current uses a similar yieldIfBusy
utility.
backgroundyield_slow(gp) | ||
} | ||
|
||
// Keep the heavy work (timeout check, park, locking) out of the inline path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: checkTimeouts
is a no-op except for the JS runtime (i.e. it isn't "heavy work").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good to know. That said, since I only need to call it when I actually a going to park, vs as part of the decision to park, I think it still belongs in the separate function to keep the inlinable check function to the minimum required to check.
84fef0d
to
ec86954
Compare
src/runtime/runtime2.go
Outdated
runqsize int32 | ||
// Global background-yield queue: goroutines that voluntarily yielded | ||
// while the scheduler was busy. Does NOT contribute to runqsize. | ||
bgq gQueue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just clarifying) So there is no per-P run queue for these (like p.runq
) because:
- We expect the number of such background goroutines to be few?
- Even if they are not few, they are only run when the foreground goroutines are not runnable, which should be rarer (when P utilization is high, and if it is low, this goroutine wouldn't have had to yield), so grabbing them from the global queue is ok wrt concurrent performance?
src/runtime/proc.go
Outdated
// | ||
// If there are any idle Ps this is a noop. | ||
// | ||
// If there are no idle Ps and the global run queue has runnable goroutines waiting, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the schedt.runq
is empty but one of the p.runq
s is non-empty? Will it keep running? I suppose we don't want to incur the synchronization of looking at each p.runq
. Should it at least look at its P's runq
?
src/runtime/proc.go
Outdated
// Fast path: tiny, inlineable checks only. | ||
|
||
// Check if we need to yield at all and early exit fast if not. | ||
if sched.npidle.Load() > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we have global atomic of npidle
already, should we consider adding one for npWithNonEmptyRunQ
? That would eliminate the unfairness from my previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are deciding to add this goroutine to bgq
when there is no idle P and:
- The global
runq
is non-empty, or - This P's
runq
is non-empty
Both the above could be false, but some other P could have a non-empty runq
(which, if this P became idle, it would steal from). This is not desirable. We can fix this by having a schedt.npWithNonEmptyRunQ
atomic: each P when it transitions from runq
empty to non-empty would increment this atomic, and decrement on the reverse transition. The second bullet above would change to npWithNonEmptyRunQ > 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I figured if global runq is empty and this G had nothing in its local that’s cheap to check. I was reluctant initially, to add a new atomic or anything that needs to be maintained in any non-background paths in case this is a patch we have to carry on our fork, thus trying to stick to just what we already have: npidle, global queue, and local queue, but not other m’s queues.
This has me thinking: if we were willing to leave a little utilization on the table, we could just say npidle < 1 is our signal. Then we can just infer that all the runqs, global and local, are empty or could be if they wanted to be since we’re leaving a whole p idle, and that possibly has even better latency characteristics than waiting for runq to be no -empty to jump out of the way at the last minute. But again, at the cost of leaving a whole p on the table. But maybe that’s ok (still better utilization than today for maxprocs>=4).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess another option is to make a num runnable atomic (if we have to keep it on a fork it isn’t that hard to grep for the casStatus(runnable) calls) and stop looking at either npidle or runqsize?
src/runtime/proc.go
Outdated
if gp.lockedm != 0 || gp.m.lockedg != 0 || gp.m.locks > 0 { | ||
return | ||
} | ||
if sched.runqsize > 0 || (gp.m.p != 0 && !runqempty(gp.m.p.ptr())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we are checking the local P's runq, so ignore part of my earlier comment.
src/runtime/proc.go
Outdated
// Yielded goroutines were runnable but voluntarily deprioritized themselves | ||
// to waiting instead by calling BackgroundYield. If we have nothing runnable | ||
// we can bring a yielded goroutine back to runnable and run it. | ||
if sched.bgqsize != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the go compiler know something special about certain fields and makes their unsynchronized reads "less stale"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I can find. The runtime and in particular scheduler uses these a-bit-stale reads liberally (though in important things like findRunnable there are locked reads backstopping them after fast paths are exhausted). I was, uh, curious about this too. As far as I can tell -- and what chatgpt tells me as well -- is they're just good old unlocked, un-syned reads. They're uint32 so no worry about tearing a write (on 32b), so staleness is the only concern. They aren't in loops out of which the load might be lifted by the compiler, so it'd be an actual load, and just up to MESI or whatever I guess. Though I did my initial testing with a cruder version of this patch, before I cleaned it up for a PR, and one of my cleanups was to split to up to ensure the cheap checks could be inlined; I wonder if that is a mistake, since if it is inlined into a for loop -- the place we expect this to be called -- maybe the load does get lifted and then never sees new values? I guess I should re-test with the optimized version.
src/runtime/proc.go
Outdated
// scheduled for at least the past duration. This allows the calling goroutine | ||
// to offer some degree of fairness among goroutines that opt in to yielding, | ||
// as otherwise yielding is only done based on the (non-background) run queue. | ||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given a background goroutine is only a background goroutine, when it successfully yields, I wonder whether one can arrange things such that it never yields. Say it keeps transitioning out of running state after every 1ms of cpu consumption for some IO, and passes 2ms to this parameter. It will not yield to the goroutines in bgq
, yes? If yes, it may justify keeping this fairness behavior out of the scheduler, since it can be accomplished by waiting for re-admission via AC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, it would never yield to background work if it kept coming from some unscheduled state. I was thinking that that if it was unscheduled, that probably means whatever background was at the head of the queue got a chance to run anyway, thanks to it unscheduling for whatever it blocked on even if not thanks to it deliberately yielding, then when our unfair caller comes back there is a non-bg runq so it yields to that. So the fairness mechanism is there just in case nothing else blocks it.
src/runtime/backgroundyield_test.go
Outdated
defer runtime.GOMAXPROCS(orig) | ||
|
||
runtime.GOMAXPROCS(target) | ||
runtime.Gosched() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this for?
src/runtime/export_test.go
Outdated
} | ||
} | ||
|
||
// RunBackgroundYieldQueueCheck exercises the background queue enqueue/dequeue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this method. What does it exercise? What does success mean? Why does it return skipped==true
if there's something in the bgq? Why is this only used in a test that doesn't seem to do anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can’t use testing.T
or testing
at all in the runtime
package, only in runtime_test
external test package. So the common pattern, if you want to test anything non-exported, seems to be a RunX
func (in this _test.go file that is not _test package) that exercises the internal code / returns an exported version, then a usually very thin TestX
in the external runtime_test package that calls it.
That’s the reason for the overall setup, but I’ll go back and review the actual logic herein in this one: I added these last couple these tests while just chasing coverage % on the train so might be some room for cleanup/commenting here.
No description provided.