Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix produce mutex to allow variables to be updated in blocking condition #917

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

erenboz
Copy link

@erenboz erenboz commented Feb 25, 2025

We have hit this bug franz-go leaking whole lot of goroutines where we call produce without cancel from a spawned goroutine for each. Looking through the code, having mutex open prevents other goroutines to update variables the wait condition is reading on. Cause both quit, overMaxRecs, overBytes all depends on p.mu. Also looks like condition Wait also missing locks required by its semantics.

func (c *Cond) Wait() {
	c.checker.check()
	t := runtime_notifyListAdd(&c.notify)
	c.L.Unlock()
	runtime_notifyListWait(&c.notify, t)
	c.L.Lock()
}

However, now this concurrency code looks fairly unnatural, might be better to re-write it eventually. It's likely hard to make a test to reproduce it in github actions environment that might have very limited actual cpu parallelism, I tried setting a high GOMAXPROCS but it still is passing with/without changes.

Issue #918

@twmb
Copy link
Owner

twmb commented Feb 25, 2025

runtime_notifyListAdd unlocks internally; the existing code is the correct usage. You're hitting something else. The first iteration of your PR would hit panics (unlocking an unlocked mutex); the existing usage unlocks and re-locks the same mutex.

What's your usage look like, and the stack dump of what leads you to believe Produce is leaking?

@erenboz
Copy link
Author

erenboz commented Feb 25, 2025

Oh I missed

	p.c = sync.NewCond(&p.mu)

That ensures that bit is correct, so deadlock must come from somewhere else.

What's your usage look like, and the stack dump of what leads you to believe Produce is leaking?

I was sifting through past deadlock titled issues and that bit of code read weird and decreasing concurrency in the service and increasing maxbufferedrecords from default 10k to 50k seemed to help. That suggest it should be still around produce blocking condition even if not directly there.

Our use-case was producing into 5-6 big topics with varying amount of partitions up to like 700-800. We also have quiet large messages so we have BatchMaxBytes 4M. And our code was spawning goroutines and calling produce with uncancellable contexts. Independent goroutine per call/topic. We also changed it to not spawn goroutine just for calling produce as that was a misconception on developer that it might be a sync blocking call. So now we do not hit issue in prod anymore.

@erenboz
Copy link
Author

erenboz commented Feb 26, 2025

We'll try to set something reproducing in prod to find whereabouts

@erenboz
Copy link
Author

erenboz commented Feb 26, 2025

We'll try to set something reproducing in prod to find whereabouts

Commented on the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants