Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler remove lock contention #179

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

musitdev
Copy link
Contributor

@musitdev musitdev commented Apr 5, 2024

I've added the unexecuted Tx verification at startup. Now, unexecuted Tx are executed.

I've put a zombies VM detection. I use the client.is_alive() like the current scheduler.

From my test I see that zombies VM always return true to this call and stay running indefinitely using all the CPU. From the VM logs, the kernel has crashed, but it seems to still running.
They are stopped after the MAX_VM_RUN_TIME, but It takes time.

So the current implementation halt crashed VM only after the MAX_VM_RUN_TIME time. I didn't manage to have a VM that crash and doesn't seem alive.

I think to do a better Zombie detection we can implement 2 things:

  • detected that the VM didn't call get_task() after a certain time. I didn't do the test, but it seems that the VM crash early.
  • Add some activity detection using the file system. In the shim SDK, we can create a task that looks at a file and, if it's not present, recreate it. This way, by removing the file, we can detect if it's recreated or not.

From my test, the number of zombie VM depends on the node CPU usage.

It's a raw implementation to test. I think we can dissociate the task scheduling part and the VM management part to simplify the loop and the VM (start/stop/zombie) management.

@musitdev musitdev requested a review from tuommaki April 5, 2024 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant