-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: fix ethtool issue when trying to skip unsupported interface #1296
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pending @ritwikranjan re-review.
ethHandle, err := ethtool.NewEthtool() | ||
if err != nil { | ||
lu.l.Error("Error while creating ethHandle: %v\n", zap.Error(err)) | ||
return fmt.Errorf("failed to create ethHandle: %w", err) | ||
} | ||
defer ethHandle.Close() | ||
|
||
ethReader := NewEthtoolReader(ethtoolOpts, ethHandle) | ||
if ethReader == nil { | ||
lu.l.Error("Error while creating ethReader") | ||
return errors.New("error while creating ethReader") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the reason for keeping the ethHandle
inside the loop is to make sure we are not bringing down Retina agent for transient error. Do you think the ethHandle
error is permanent - as in if it fails once, it cannot succeed on retry?
If you return error here, PluginManager
will treat it as a fatal error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brining this out is probably ok, but I would retry few time (with exponential backoff) before reporting failure.
The ethHandle
opens an Unix socket and uses it to get stats about interfaces - https://github.com/safchain/ethtool/blob/c20939d9864b2df0943b7c0a2363f8cab34072c7/ethtool.go#L1029C13-L1029C24 .
While recreating it every time incurs some overhead, keeping a long running socket open can potentially consume resources (we are holding onto a socket on host for long time). Also, long-lived sockets might encounter issues like network changes or timeouts, requiring robust error handling. You have to add code to make sure the connection is alive, otherwise we will keep seeing errors and no metrics, and would require user intervention to restart the agent container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your cache shouldn't be tied to the lifecycle of the tool, it should persist for agent's lifecycle.
@@ -58,8 +58,6 @@ func (er *EthtoolReader) readInterfaceStats() error { | |||
return err | |||
} | |||
|
|||
defer er.ethHandle.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like closing this here, thanks for refactoring this.
Description
This PR addresses an issue where the plugin repeatedly logged the same operation not supported error for interfaces on every metrics collection cycle. The root cause ethtool handle is created and closed during each interval, which invalidated any cached (LRU) information on unsupported interfaces. Now, once an interface is identified as unsupported, it won’t repeatedly log the same error.
Related Issue
If this pull request is related to any issue, please mention it here. Additionally, make sure that the issue is assigned to you before submitting this pull request.
#1280
Checklist
git commit -S -s ...
). See this documentation on signing commits.Screenshots (if applicable) or Testing Completed
Please add any relevant screenshots or GIFs to showcase the changes made.
Logs before fixing the issue:
![Before](https://private-user-images.githubusercontent.com/76769345/408593347-edae5f1e-275e-4de6-9f74-23100dc49600.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5NDkyNDksIm5iZiI6MTczODk0ODk0OSwicGF0aCI6Ii83Njc2OTM0NS80MDg1OTMzNDctZWRhZTVmMWUtMjc1ZS00ZGU2LTlmNzQtMjMxMDBkYzQ5NjAwLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA3VDE3MjIyOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTFjNzQzYWQ2YmZkYWZiOTY5MGE2MjUzOWNhOTQ5YmUwODE3MTdmMGMxODYyMjNjNGE1YzYzMjQ4NzMxMDMxYmEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.LbouCWYyQt-l3CwAujQY6Ita-_a4oyqpnpBg-pUJR2g)
![After](https://private-user-images.githubusercontent.com/76769345/408593994-ffbff4da-431b-4349-bc67-bdb4d7818fb9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5NDkyNDksIm5iZiI6MTczODk0ODk0OSwicGF0aCI6Ii83Njc2OTM0NS80MDg1OTM5OTQtZmZiZmY0ZGEtNDMxYi00MzQ5LWJjNjctYmRiNGQ3ODE4ZmI5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA3VDE3MjIyOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTRmNGE1N2JmYTRmZWFlMmM1NDBhNTJiNGFiN2E1YjEyNTFmNGQxYmYyZjljODBkMzMxMWQwNTM1NWZiNTI5MjImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.gAIeXZtR8fAvj-OXk8dAoyLwSeHZ7mWiFthizv03maw)
Logs after fixing the issue, the unsupported cache will only log once now (add debug message for test and validate behaviour of LRU, which has been removed in PR):
Additional Notes
Add any additional notes or context about the pull request here.
Please refer to the CONTRIBUTING.md file for more information on how to contribute to this project.