Fix produce mutex to allow variables to be updated in blocking condition #917

erenboz · 2025-02-25T14:41:01Z

We have hit this bug franz-go leaking whole lot of goroutines where we call produce without cancel from a spawned goroutine for each. Looking through the code, having mutex open prevents other goroutines to update variables the wait condition is reading on. Cause both quit, overMaxRecs, overBytes all depends on p.mu. Also looks like condition Wait also missing locks required by its semantics.

func (c *Cond) Wait() {
	c.checker.check()
	t := runtime_notifyListAdd(&c.notify)
	c.L.Unlock()
	runtime_notifyListWait(&c.notify, t)
	c.L.Lock()
}

However, now this concurrency code looks fairly unnatural, might be better to re-write it eventually. It's likely hard to make a test to reproduce it in github actions environment that might have very limited actual cpu parallelism, I tried setting a high GOMAXPROCS but it still is passing with/without changes.

Issue #918

twmb · 2025-02-25T17:09:04Z

runtime_notifyListAdd unlocks internally; the existing code is the correct usage. You're hitting something else. The first iteration of your PR would hit panics (unlocking an unlocked mutex); the existing usage unlocks and re-locks the same mutex.

What's your usage look like, and the stack dump of what leads you to believe Produce is leaking?

erenboz · 2025-02-25T20:50:12Z

Oh I missed

	p.c = sync.NewCond(&p.mu)

That ensures that bit is correct, so deadlock must come from somewhere else.

What's your usage look like, and the stack dump of what leads you to believe Produce is leaking?

I was sifting through past deadlock titled issues and that bit of code read weird and decreasing concurrency in the service and increasing maxbufferedrecords from default 10k to 50k seemed to help. That suggest it should be still around produce blocking condition even if not directly there.

Our use-case was producing into 5-6 big topics with varying amount of partitions up to like 700-800. We also have quiet large messages so we have BatchMaxBytes 4M. And our code was spawning goroutines and calling produce with uncancellable contexts. Independent goroutine per call/topic. We also changed it to not spawn goroutine just for calling produce as that was a misconception on developer that it might be a sync blocking call. So now we do not hit issue in prod anymore.

erenboz · 2025-02-26T09:54:10Z

We'll try to set something reproducing in prod to find whereabouts

erenboz · 2025-02-26T13:20:47Z

We'll try to set something reproducing in prod to find whereabouts

Commented on the issue

fix produce mutex to allow variables to be updated in blocking condition

469cd01

erenboz mentioned this pull request Feb 25, 2025

Concurrency issue leading to deadlock on produce #918

Open

erenboz added 2 commits February 25, 2025 17:35

add condition locks as required by Wait

613413a

rearrange locks

45e0bbb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix produce mutex to allow variables to be updated in blocking condition #917

Fix produce mutex to allow variables to be updated in blocking condition #917

erenboz commented Feb 25, 2025 •

edited

Loading

twmb commented Feb 25, 2025

erenboz commented Feb 25, 2025 •

edited

Loading

erenboz commented Feb 26, 2025

erenboz commented Feb 26, 2025

Fix produce mutex to allow variables to be updated in blocking condition #917

Are you sure you want to change the base?

Fix produce mutex to allow variables to be updated in blocking condition #917

Conversation

erenboz commented Feb 25, 2025 • edited Loading

twmb commented Feb 25, 2025

erenboz commented Feb 25, 2025 • edited Loading

erenboz commented Feb 26, 2025

erenboz commented Feb 26, 2025

erenboz commented Feb 25, 2025 •

edited

Loading

erenboz commented Feb 25, 2025 •

edited

Loading