Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why another scheduler? What purpose of this project? #26

Open
Cdayz opened this issue Mar 3, 2024 · 3 comments
Open

Why another scheduler? What purpose of this project? #26

Cdayz opened this issue Mar 3, 2024 · 3 comments

Comments

@Cdayz
Copy link

Cdayz commented Mar 3, 2024

Yeah, my question is pretty simple, why you started to built new one k8s scheduler?

There are many different production-grade solutions like:

They offers the same functiionality and much more extra things that are already built-in.

What purpose of this project?

@NickrenREN
Copy link
Collaborator

NickrenREN commented Mar 5, 2024

@Cdayz The main goal of this project is to provide an unified scheduler for online and offline workloads, so that it will be easier to do colocation and improve resource utilization and resource elasticity.

Volcano is a offline scheduler, and yunikorn only provides kubernetes adaptor.

In Bytedance, the cluster scale is very large (20k nodes, 1000k pods in one single cluster) and the business scenarios are complex, it is difficult to use existing scheduler directly and the development effort (based on them) is not acceptable.

@Cdayz
Copy link
Author

Cdayz commented Mar 7, 2024

@NickrenREN, can you please describe the difference between online and offline workloads, for example, what you mean by that?

I am asking this question because, according to my understanding, an online scheduler is a scheduler that does not know when a new task arrives or when an already running task will finish. A volcano can be used in this environment as far as i know.

Maybe I am wrong, and if I am wrong, I apologize for wasting your time. However, I think it is important to clarify the goals of this project and possibly write a decision record with the pros and cons of other solutions, along with some clarification as to why this one is necessary.

@NickrenREN
Copy link
Collaborator

@Cdayz generally speaking, online workloads are SLA, latency sensitive workloads, such as micro-service workloads, RPC services, and offline workloads are mostly throughput oriented and care more about job completion time, such as Hadhoop batch apps and ML training tasks...

They care about different metrics, the scheduling requirements are different, for example: Hadhoop batch apps need high scheduling throughput (1k pods per second in our prod env), and ML training tasks need Gang, Job level affinity (all tasks in one job need to be scheduled into one Tor or one network segment, any Tor or network segment is ok but can not cross) and some other complex features...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants