Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrency issue leading to deadlock on produce #918

Open
erenboz opened this issue Feb 25, 2025 · 3 comments
Open

Concurrency issue leading to deadlock on produce #918

erenboz opened this issue Feb 25, 2025 · 3 comments
Labels

Comments

@erenboz
Copy link

erenboz commented Feb 25, 2025

There seems to be continuation issue to #777 where fix doesn't seem completely concurrency correct.

We have hit this bug franz-go leaking whole lot of goroutines where we call produce without cancel from a spawned goroutine for each. Initially thought it was bad concurrency around wait condition but it doesn't seem to be the case I just missed Cond was reusing same mutex.

We hit this issue on highly parallelized usage where we had gomaxprocs > 45 on 36 physical core cpu so might require just right kind of race to reproduce in test. If change is not acceptable as is with current test, I might look into writing one later.

@erenboz
Copy link
Author

erenboz commented Feb 26, 2025

We'll try to set something reproducing in prod to find whereabouts

Maxbuffered records set at 5, GOMAXPROCS 80 running on 80 CPU gke n2 node stack looks like

Image

No kafka errors, on callbacks, median latency over 10s (metric bucket limitation). Steady climb on goroutines indicating Some produce calls are not returning.

Image

Our effective code is 6-7x per request (300-400 request/sec) on 5-6 different topics, partitions up to 960, max message bytes up to 4M

go kclient.Produce(
		context.Background(),
  		&kgo.Record{
  			Value: message,
  			Topic: topic,
  		}, callback))

@twmb
Copy link
Owner

twmb commented Mar 7, 2025

What's the callback? What's the stack trace once you're at that many goroutines? (you likely need to redact a lot in the stack trace).

Not to say the code in franz-go is flawless -- it definitely isn't (hence the dozens of patch releases...) -- but so far every report of a goroutine leak in the past few years has been a bug in the usage around the client.

@twmb twmb added the waiting label Mar 7, 2025
@erenboz
Copy link
Author

erenboz commented Mar 8, 2025

Yeah that's fair, I tried writing a test within library codebase but was unable to reproduce anything. There is nothing obvious in callback that'd block or leak goroutines but I'll give it a try to put similar test in the service for isolation when back to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants