-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simulator: record events on prod cluster and replay them on a fake cluster any time #395
Comments
+1, thanks for a proposal!
No need. Please just go ahead :) |
/area simulator |
Thanks:) When replaying the events, the logic of applying resources will be the same as that of syncer. So I gonna make the logic reusable by making a new package. But oneshotimporter also has the codes of applying resources. So how about modularizing the codes in oneshotimporter and syncer before implementing this feature? |
sg, I think the applying and stuff can be in a certain package, and oneshot/syncer/replayer can use it. |
Okay, I'll do it first and make a PR. |
/kind feature
This issue proposes a new feature to record events on a prod cluster and replay them on a fake cluster any time.
Background
Debugging customized schedulers is a complex challenge. One of the reasons is bugs that only occur on the prod cluster. It is hard to reproduce the issue on a fake cluster because the fake cluster does not have the same load as the prod cluster.
We have the syncer feature that makes it easy to simulate a real load on a fake cluster. However, it would be more helpful to save a series of events on the prod cluster that cause the issue and replay them on a fake cluster any time especially for debugging.
Goals
Users can run a process that watches events on the prod cluster and saves them in some way (e.g. a JSON file). Then they can run the simulator and replay the events on a fake cluster.
User Stories
Story 1
An organization has implemented their own scheduler plugins. The plugins cause an issue only on the prod cluster. They want to reproduce the issue on a fake cluster.
Solution
They can record the events that cause the issue on the prod cluster. Then they can replay the events on a fake cluster to reproduce the issue.
Story 2
They have implemented a new plugin. They want to test and evaluate it with a real load before deploying it to the prod cluster.
Solution
They can record the events on the prod cluster and save them. When they implement a new plugin, they can use the recorded events to test and evaluate the plugin.
Note
This might be a fairly large feature, so please let me know if we need a KEP.
The text was updated successfully, but these errors were encountered: