Possible memory leak in Go agent #906

betabandido · 2024-04-29T14:20:31Z

Description

We have several Go applications that use the New Relic Go agent, and they work well. But there is a particular one that seems to be experiencing what looks like a memory leak. The setup for the agent is very similar in these applications. This might suggest that the problem is elsewhere, but we have ran Go's pprof tool and everything points to an issue with the NR Go agent. We're trying to create a reproducible example, but in the meantime we wanted to create this issue, and add to it as much information as possible, with the hope of setting the wheels turning to find a solution.

Steps to Reproduce

We cannot reproduce the issue outside of this particular application (yet).

Expected Behavior

Memory should not be increasing over time.

NR Diag results

The following diagram shows the memory usage for the application running on our kubernetes cluster.

As the image shows, memory keeps increasing until either we deploy a new release or the kubernetes cluster kills the app because of consuming too much memory.

Your Environment

A Go application compiled using Go 1.21, running on a kubernetes cluster (EKS 1.27).

New Relic Go agent version 3.32.0.

Reproduction case

We are working on trying to reproduce the issue.

Additional context

Here you can see two outputs from pprof. The first one is from last Friday (Apr 26th), and the second one is from today (Apr 29th). As you can see the heap memory in use within StoreLog method has doubled (from ~40 MB to ~80 MB).

Apr 26th
Apr 29th

The text was updated successfully, but these errors were encountered:

iamemilio · 2024-04-29T16:16:03Z

Hi @betabandido. Before we flag this as a memory leak, we need to inform you that as a feature of the go agent, transactions are stored on logs until that transaction ends. Transactions that run for a long time will accumulate all sorts of memory, but logs happen to be very large. We do this this way because of sampling in the agent. The weight of a sampled transaction is not calculated until it is finished, so rather than risk dropping critical data, we hold onto it. All logs associated with a transaction are stored on board until that transaction ends. This is currently a design requirement of all agents.

betabandido · 2024-04-29T18:18:31Z

Thanks @iamemilio ! That makes sense. Yet, it seems we have none of the following:

long-lasting transactions
a continuously increasing number of transactions

Therefore, I would expect StoreLog to consume a certain amount of memory, but not to increase its memory consumption over time.

I collected some data on how long the transactions for this particular app are with:

SELECT average(duration * 1000), percentile(duration * 1000, 99), max(duration * 1000)
FROM Transaction 
WHERE appName = '<our app name>'
SINCE 7 days ago

This is what I got:

11.7 ms (avg)
78 ms (99th percentile)
3360 ms (max)

No really long-lasting transactions here.

For the number of transactions, I used:

SELECT count(*)
FROM Transaction 
WHERE appName = '<our app name>'
SINCE 7 days ago
TIMESERIES 1 hour

and I got the following chart:

There is a cyclic pattern, but I cannot see an ever-growing number of transactions between April 26th and 29th.

I acknowledge this might be a difficult issue to debug without a reproducible example. But, while we try to come up with one, please let us know how else we can help you gather more data to pinpoint the root cause.

betabandido · 2024-04-30T07:18:51Z

@iamemilio This morning I just got some more data with pprof. Memory consumption in StoreLog went from ~80 MB to ~100 MB (see diagram below). Eventually the pods will get killed, and the process will start over again.

mirackara · 2024-04-30T20:17:02Z

Hi @betabandido. Thanks for the profiling data and bringing this to our attention (especially the last couple of updates!). We will continue investigating this, but it may take a while to identify. If a reproducible example is possible, that would be greatly appreciated and will help us pinpoint the issue faster. We'll keep monitoring this thread in the meanwhile.

Thanks!

betabandido · 2024-05-10T08:15:45Z

Hi @iamemilio @mirackara. We have come up with a reproducible example that we can share. What would be the preferred way to do so?

mirackara · 2024-05-13T19:09:13Z

Hi @betabandido. A public repository would be great if possible.

iamemilio · 2024-05-13T19:13:32Z

If you would prefer not to share code publicly, then please get in touch with New Relic support and request assistance with the Go agent. An agent will handle your case and we will be able to communicate privately from there. That being said, the issue is already public, so as long as the reproducer is not sensitive to you, then we don't mind it being posted somewhere public, or just inline in this thread even.

betabandido · 2024-05-22T09:31:38Z

@mirackara @iamemilio I have created this repo: https://github.com/betabandido/nrml

I've also added some instructions that hopefully will help. Please let me know if you have any doubts or questions.

mirackara · 2024-06-04T16:26:42Z

Hi @betabandido

First off, huge kudos to you + team. This reproducer is very detailed and will be a massive help for us. We don't have an ETA on this yet, but we are tracking this internally and will give the investigation the time it deserves. If we have any updates/questions we'll post them in this thread.

Thank You!

nr-swilloughby · 2024-07-15T15:09:13Z

FYI, we're actively working on this issue and will update this when we have a solution.

betabandido · 2024-07-18T14:21:00Z

@nr-swilloughby Thanks for the update! Does this mean you have identified the issue and are in the process of solving it?

iamemilio · 2024-10-07T16:34:45Z

A fix for the race condition should be in the upcoming release. They are also staged in develop right now if you wanted to try it early.

betabandido · 2024-10-07T19:25:06Z

@iamemilio That's great to know! Out of curiosity, could you please share a link to the commit that fixed the issue?

nr-swilloughby · 2024-10-10T14:31:28Z

Apologies, @betabandido, the commit Emilio referred to was a different issue (a race condition) that we're releasing a fix for today. Before I commit changes to the core harvester code for your issue, I want to confirm a few things to make absolutely sure we've eliminated other causes for the memory issue you reported, and accounted for any odd differences between your environment and mine. I've also looked at the code outside AWS as well just to be sure we reduce the number of variables in play as much as possible.

Your issue is related to the log events on transactions being retained when they shouldn't be. The problem is that under any of the circumstances we've observed, that memory should be released after every successful harvest cycle that includes the event log data. You are adding your database query parameters to your transaction logs, which may account for the data volume, but still wouldn't explain how the data aren't getting garbage collected eventually.

So one thing I want to confirm is that those harvests are, in fact, happening successfully for you when you run this code in your AWS environment. Because if they aren't, then the agent will re-consume the harvest data so that it can hold it for the next harvest cycle, and if that continues to not happen, it could retain them indefinitely.

Could you add a call to ConfigDebugLogger to newrelic.NewApplication in your application so that it captures debug-level logs of its operation and send me a sample of that output (redacting any sensitive information as needed). What I'm looking for is the interaction between the agent and New Relic's collector endpoint where it is either successfully delivering log event data during the harvest cycles, or the collector is throwing back an error which might make the agent hold on to the data longer.

If you could also confirm that this has always been an issue with this one application, or if it suddenly started happening after a particular release of the Go Agent, The Go language tools, or other dependencies.

I think we're close to giving you a final answer on this, but want to be sure the above is clear before committing anything.
Thanks!

nr-swilloughby · 2024-10-23T19:10:30Z

There is one other possible source of a memory leak you're seeing that wouldn't show up in any of our testing on our side, and that is if you declare a Transaction value which stays in scope in your application (regardless of whether End() was called to end the transaction), it will hold on to all the data collected for that transaction. Only after the Transaction value goes out of scope can the Go garbage collector reclaim its storage (assuming it's the last thing pointing to those data). You might check to see if that could be happening in your application.

betabandido added the bug label Apr 29, 2024

nr-swilloughby mentioned this issue Oct 15, 2024

emergency memory cleanup of harvester #974

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible memory leak in Go agent #906

Possible memory leak in Go agent #906

betabandido commented Apr 29, 2024

iamemilio commented Apr 29, 2024 •

edited

Loading

betabandido commented Apr 29, 2024 •

edited

Loading

betabandido commented Apr 30, 2024

mirackara commented Apr 30, 2024

betabandido commented May 10, 2024

mirackara commented May 13, 2024

iamemilio commented May 13, 2024 •

edited

Loading

betabandido commented May 22, 2024

mirackara commented Jun 4, 2024

nr-swilloughby commented Jul 15, 2024

betabandido commented Jul 18, 2024

iamemilio commented Oct 7, 2024 •

edited

Loading

betabandido commented Oct 7, 2024

nr-swilloughby commented Oct 10, 2024

nr-swilloughby commented Oct 23, 2024

Possible memory leak in Go agent #906

Possible memory leak in Go agent #906

Comments

betabandido commented Apr 29, 2024

Description

Steps to Reproduce

Expected Behavior

NR Diag results

Your Environment

Reproduction case

Additional context

iamemilio commented Apr 29, 2024 • edited Loading

betabandido commented Apr 29, 2024 • edited Loading

betabandido commented Apr 30, 2024

mirackara commented Apr 30, 2024

betabandido commented May 10, 2024

mirackara commented May 13, 2024

iamemilio commented May 13, 2024 • edited Loading

betabandido commented May 22, 2024

mirackara commented Jun 4, 2024

nr-swilloughby commented Jul 15, 2024

betabandido commented Jul 18, 2024

iamemilio commented Oct 7, 2024 • edited Loading

betabandido commented Oct 7, 2024

nr-swilloughby commented Oct 10, 2024

nr-swilloughby commented Oct 23, 2024

iamemilio commented Apr 29, 2024 •

edited

Loading

betabandido commented Apr 29, 2024 •

edited

Loading

iamemilio commented May 13, 2024 •

edited

Loading

iamemilio commented Oct 7, 2024 •

edited

Loading