simulator: record events on prod cluster and replay them on a fake cluster any time #395

saza-ku · 2024-11-25T08:49:45Z

/kind feature

This issue proposes a new feature to record events on a prod cluster and replay them on a fake cluster any time.

Background

Debugging customized schedulers is a complex challenge. One of the reasons is bugs that only occur on the prod cluster. It is hard to reproduce the issue on a fake cluster because the fake cluster does not have the same load as the prod cluster.

We have the syncer feature that makes it easy to simulate a real load on a fake cluster. However, it would be more helpful to save a series of events on the prod cluster that cause the issue and replay them on a fake cluster any time especially for debugging.

Goals

Users can run a process that watches events on the prod cluster and saves them in some way (e.g. a JSON file). Then they can run the simulator and replay the events on a fake cluster.

User Stories

Story 1

An organization has implemented their own scheduler plugins. The plugins cause an issue only on the prod cluster. They want to reproduce the issue on a fake cluster.

Solution

They can record the events that cause the issue on the prod cluster. Then they can replay the events on a fake cluster to reproduce the issue.

Story 2

They have implemented a new plugin. They want to test and evaluate it with a real load before deploying it to the prod cluster.

Solution

They can record the events on the prod cluster and save them. When they implement a new plugin, they can use the recorded events to test and evaluate the plugin.

Note

This might be a fairly large feature, so please let me know if we need a KEP.

sanposhiho · 2024-11-26T01:20:46Z

+1, thanks for a proposal!

This might be a fairly large feature, so please let me know if we need a KEP.

No need. Please just go ahead :)

sanposhiho · 2024-11-26T01:20:57Z

/area simulator

saza-ku · 2024-12-05T07:16:53Z

Thanks:)

When replaying the events, the logic of applying resources will be the same as that of syncer. So I gonna make the logic reusable by making a new package.

But oneshotimporter also has the codes of applying resources. So how about modularizing the codes in oneshotimporter and syncer before implementing this feature?

sanposhiho · 2024-12-05T09:28:22Z

sg, I think the applying and stuff can be in a certain package, and oneshot/syncer/replayer can use it.

saza-ku · 2024-12-05T11:35:33Z

Okay, I'll do it first and make a PR.

saza-ku · 2024-12-16T07:13:01Z

#376 is going to change oneshotimporter, so I first separated resourceapplier from syncer (#400).

Next I'll implement the replaying feature using it before fixing oneshotimporter.

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 25, 2024

k8s-ci-robot added the area/simulator Issues or PRs related to the simulator. label Nov 26, 2024

saza-ku mentioned this issue Dec 16, 2024

Separate the logic of applying resources from syncer and make it reusable #400

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simulator: record events on prod cluster and replay them on a fake cluster any time #395

simulator: record events on prod cluster and replay them on a fake cluster any time #395

saza-ku commented Nov 25, 2024

sanposhiho commented Nov 26, 2024

sanposhiho commented Nov 26, 2024

saza-ku commented Dec 5, 2024

sanposhiho commented Dec 5, 2024

saza-ku commented Dec 5, 2024

saza-ku commented Dec 16, 2024 •

edited

Loading

simulator: record events on prod cluster and replay them on a fake cluster any time #395

simulator: record events on prod cluster and replay them on a fake cluster any time #395

Comments

saza-ku commented Nov 25, 2024

Background

Goals

User Stories

Story 1

Solution

Story 2

Solution

Note

sanposhiho commented Nov 26, 2024

sanposhiho commented Nov 26, 2024

saza-ku commented Dec 5, 2024

sanposhiho commented Dec 5, 2024

saza-ku commented Dec 5, 2024

saza-ku commented Dec 16, 2024 • edited Loading

saza-ku commented Dec 16, 2024 •

edited

Loading