Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix ethtool issue when trying to skip unsupported interface #1296

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

xiaozhiche320
Copy link
Contributor

@xiaozhiche320 xiaozhiche320 commented Jan 31, 2025

Description

This PR addresses an issue where the plugin repeatedly logged the same operation not supported error for interfaces on every metrics collection cycle. The root cause ethtool handle is created and closed during each interval, which invalidated any cached (LRU) information on unsupported interfaces. Now, once an interface is identified as unsupported, it won’t repeatedly log the same error.

Related Issue

If this pull request is related to any issue, please mention it here. Additionally, make sure that the issue is assigned to you before submitting this pull request.

#1280

Checklist

  • I have read the contributing documentation.
  • I signed and signed-off the commits (git commit -S -s ...). See this documentation on signing commits.
  • I have correctly attributed the author(s) of the code.
  • I have tested the changes locally.
  • I have followed the project's style guidelines.
  • I have updated the documentation, if necessary.
  • I have added tests, if applicable.

Screenshots (if applicable) or Testing Completed

Please add any relevant screenshots or GIFs to showcase the changes made.

Logs before fixing the issue:
Before
Logs after fixing the issue, the unsupported cache will only log once now (add debug message for test and validate behaviour of LRU, which has been removed in PR):
After

Additional Notes

Add any additional notes or context about the pull request here.


Please refer to the CONTRIBUTING.md file for more information on how to contribute to this project.

@xiaozhiche320 xiaozhiche320 requested a review from a team as a code owner January 31, 2025 14:13
pkg/plugin/linuxutil/linuxutil_linux.go Outdated Show resolved Hide resolved
pkg/plugin/linuxutil/ethtool_stats_linux.go Outdated Show resolved Hide resolved
pkg/plugin/linuxutil/ethtool_handle_linux.go Outdated Show resolved Hide resolved
@xiaozhiche320 xiaozhiche320 self-assigned this Jan 31, 2025
Copy link
Member

@SRodi SRodi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending @ritwikranjan re-review.

Comment on lines +72 to +83
ethHandle, err := ethtool.NewEthtool()
if err != nil {
lu.l.Error("Error while creating ethHandle: %v\n", zap.Error(err))
return fmt.Errorf("failed to create ethHandle: %w", err)
}
defer ethHandle.Close()

ethReader := NewEthtoolReader(ethtoolOpts, ethHandle)
if ethReader == nil {
lu.l.Error("Error while creating ethReader")
return errors.New("error while creating ethReader")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the reason for keeping the ethHandle inside the loop is to make sure we are not bringing down Retina agent for transient error. Do you think the ethHandle error is permanent - as in if it fails once, it cannot succeed on retry?
If you return error here, PluginManager will treat it as a fatal error.

Copy link
Contributor

@anubhabMajumdar anubhabMajumdar Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brining this out is probably ok, but I would retry few time (with exponential backoff) before reporting failure.

The ethHandle opens an Unix socket and uses it to get stats about interfaces - https://github.com/safchain/ethtool/blob/c20939d9864b2df0943b7c0a2363f8cab34072c7/ethtool.go#L1029C13-L1029C24 .

While recreating it every time incurs some overhead, keeping a long running socket open can potentially consume resources (we are holding onto a socket on host for long time). Also, long-lived sockets might encounter issues like network changes or timeouts, requiring robust error handling. You have to add code to make sure the connection is alive, otherwise we will keep seeing errors and no metrics, and would require user intervention to restart the agent container.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your cache shouldn't be tied to the lifecycle of the tool, it should persist for agent's lifecycle.

@@ -58,8 +58,6 @@ func (er *EthtoolReader) readInterfaceStats() error {
return err
}

defer er.ethHandle.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like closing this here, thanks for refactoring this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants