Skip to content

Conversation

dt
Copy link
Member

@dt dt commented Oct 6, 2025

No description provided.

@dt dt requested review from sumeerbhola and tbg October 6, 2025 22:17
Copy link

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with experimenting with this Go runtime enhancement. I definitely want to see experimental evidence of the benefit.

PS Should probably update the print in schedtrace to include sched.bgqsize.

if gp.lockedm != 0 || gp.m.lockedg != 0 || gp.m.locks > 0 {
return
}
if sched.runqsize > 0 || (gp.m.p != 0 && !runqempty(gp.m.p.ptr())) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need to hold sched.lock when checking sched.runqsize and sched.bgsqize? Hmm, it looks like runqsize is sometimes checked without the lock and sometimes with, though it is always modified with the lock held. Seems like the runtime authors just assume this is understood. Worth dropping a comment here about the safety.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had a similar question when I was reading findRunnable and noticed reads of runqsize without holding the lock and it is not using atomics and indeed specifically asked ChatGPT to look for all such cases and outline what's going on. My take-away was that it being done where the compiler isn't going to hoist the load entirely out of the loop or something, so it will actually be a load that's just a the mercy of cache coherence delays, and that slightly stale read is good enough for the sort of "first pass" decisions it is used for, with critical decisions then falling back to a locked read like the one at the bottom of findRunnable if the unlocked reads didn't find anything. Yielding seems like just such a case, where it is more important we be cheap than perfect, as we'll just yield a few checks later once our cache updates.

}

gp := getg()
// Don't yield if locked to an OS thread or holding runtime locks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought runtime locks were only held during internal runtime operations. That is, runtime locks can't be held if application code is calling BackgroundYield. Is that check purely defensive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, just an abundance of caution since it would be not good to park while holding one. But I don't really see how you could unless the runtime itself were to use this, eg in the gc loop (that current uses a similar yieldIfBusy utility.

backgroundyield_slow(gp)
}

// Keep the heavy work (timeout check, park, locking) out of the inline path.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: checkTimeouts is a no-op except for the JS runtime (i.e. it isn't "heavy work").

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good to know. That said, since I only need to call it when I actually a going to park, vs as part of the decision to park, I think it still belongs in the separate function to keep the inlinable check function to the minimum required to check.

@rickystewart rickystewart force-pushed the cockroach-go1.23.12 branch 3 times, most recently from 84fef0d to ec86954 Compare October 9, 2025 21:21
runqsize int32
// Global background-yield queue: goroutines that voluntarily yielded
// while the scheduler was busy. Does NOT contribute to runqsize.
bgq gQueue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just clarifying) So there is no per-P run queue for these (like p.runq) because:

  • We expect the number of such background goroutines to be few?
  • Even if they are not few, they are only run when the foreground goroutines are not runnable, which should be rarer (when P utilization is high, and if it is low, this goroutine wouldn't have had to yield), so grabbing them from the global queue is ok wrt concurrent performance?

//
// If there are any idle Ps this is a noop.
//
// If there are no idle Ps and the global run queue has runnable goroutines waiting,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the schedt.runq is empty but one of the p.runqs is non-empty? Will it keep running? I suppose we don't want to incur the synchronization of looking at each p.runq. Should it at least look at its P's runq?

// Fast path: tiny, inlineable checks only.

// Check if we need to yield at all and early exit fast if not.
if sched.npidle.Load() > 0 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we have global atomic of npidle already, should we consider adding one for npWithNonEmptyRunQ? That would eliminate the unfairness from my previous comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are deciding to add this goroutine to bgq when there is no idle P and:

  • The global runq is non-empty, or
  • This P's runq is non-empty

Both the above could be false, but some other P could have a non-empty runq (which, if this P became idle, it would steal from). This is not desirable. We can fix this by having a schedt.npWithNonEmptyRunQ atomic: each P when it transitions from runq empty to non-empty would increment this atomic, and decrement on the reverse transition. The second bullet above would change to npWithNonEmptyRunQ > 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I figured if global runq is empty and this G had nothing in its local that’s cheap to check. I was reluctant initially, to add a new atomic or anything that needs to be maintained in any non-background paths in case this is a patch we have to carry on our fork, thus trying to stick to just what we already have: npidle, global queue, and local queue, but not other m’s queues.

This has me thinking: if we were willing to leave a little utilization on the table, we could just say npidle < 1 is our signal. Then we can just infer that all the runqs, global and local, are empty or could be if they wanted to be since we’re leaving a whole p idle, and that possibly has even better latency characteristics than waiting for runq to be no -empty to jump out of the way at the last minute. But again, at the cost of leaving a whole p on the table. But maybe that’s ok (still better utilization than today for maxprocs>=4).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess another option is to make a num runnable atomic (if we have to keep it on a fork it isn’t that hard to grep for the casStatus(runnable) calls) and stop looking at either npidle or runqsize?

if gp.lockedm != 0 || gp.m.lockedg != 0 || gp.m.locks > 0 {
return
}
if sched.runqsize > 0 || (gp.m.p != 0 && !runqempty(gp.m.p.ptr())) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we are checking the local P's runq, so ignore part of my earlier comment.

// Yielded goroutines were runnable but voluntarily deprioritized themselves
// to waiting instead by calling BackgroundYield. If we have nothing runnable
// we can bring a yielded goroutine back to runnable and run it.
if sched.bgqsize != 0 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the go compiler know something special about certain fields and makes their unsynchronized reads "less stale"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I can find. The runtime and in particular scheduler uses these a-bit-stale reads liberally (though in important things like findRunnable there are locked reads backstopping them after fast paths are exhausted). I was, uh, curious about this too. As far as I can tell -- and what chatgpt tells me as well -- is they're just good old unlocked, un-syned reads. They're uint32 so no worry about tearing a write (on 32b), so staleness is the only concern. They aren't in loops out of which the load might be lifted by the compiler, so it'd be an actual load, and just up to MESI or whatever I guess. Though I did my initial testing with a cruder version of this patch, before I cleaned it up for a PR, and one of my cleanups was to split to up to ensure the cheap checks could be inlined; I wonder if that is a mistake, since if it is inlined into a for loop -- the place we expect this to be called -- maybe the load does get lifted and then never sees new values? I guess I should re-test with the optimized version.

// scheduled for at least the past duration. This allows the calling goroutine
// to offer some degree of fairness among goroutines that opt in to yielding,
// as otherwise yielding is only done based on the (non-background) run queue.
//

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given a background goroutine is only a background goroutine, when it successfully yields, I wonder whether one can arrange things such that it never yields. Say it keeps transitioning out of running state after every 1ms of cpu consumption for some IO, and passes 2ms to this parameter. It will not yield to the goroutines in bgq, yes? If yes, it may justify keeping this fairness behavior out of the scheduler, since it can be accomplished by waiting for re-admission via AC.

Copy link
Member Author

@dt dt Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it would never yield to background work if it kept coming from some unscheduled state. I was thinking that that if it was unscheduled, that probably means whatever background was at the head of the queue got a chance to run anyway, thanks to it unscheduling for whatever it blocked on even if not thanks to it deliberately yielding, then when our unfair caller comes back there is a non-bg runq so it yields to that. So the fairness mechanism is there just in case nothing else blocks it.

defer runtime.GOMAXPROCS(orig)

runtime.GOMAXPROCS(target)
runtime.Gosched()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this for?

}
}

// RunBackgroundYieldQueueCheck exercises the background queue enqueue/dequeue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this method. What does it exercise? What does success mean? Why does it return skipped==true if there's something in the bgq? Why is this only used in a test that doesn't seem to do anything?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can’t use testing.T or testing at all in the runtime package, only in runtime_test external test package. So the common pattern, if you want to test anything non-exported, seems to be a RunX func (in this _test.go file that is not _test package) that exercises the internal code / returns an exported version, then a usually very thin TestX in the external runtime_test package that calls it.

That’s the reason for the overall setup, but I’ll go back and review the actual logic herein in this one: I added these last couple these tests while just chasing coverage % on the train so might be some room for cleanup/commenting here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants