Why another scheduler? What purpose of this project? #26

Cdayz · 2024-03-03T18:43:07Z

Yeah, my question is pretty simple, why you started to built new one k8s scheduler?

There are many different production-grade solutions like:

They offers the same functiionality and much more extra things that are already built-in.

What purpose of this project?

NickrenREN · 2024-03-05T07:08:58Z

@Cdayz The main goal of this project is to provide an unified scheduler for online and offline workloads, so that it will be easier to do colocation and improve resource utilization and resource elasticity.

Volcano is a offline scheduler, and yunikorn only provides kubernetes adaptor.

In Bytedance, the cluster scale is very large (20k nodes, 1000k pods in one single cluster) and the business scenarios are complex, it is difficult to use existing scheduler directly and the development effort (based on them) is not acceptable.

Cdayz · 2024-03-07T18:19:26Z

@NickrenREN, can you please describe the difference between online and offline workloads, for example, what you mean by that?

I am asking this question because, according to my understanding, an online scheduler is a scheduler that does not know when a new task arrives or when an already running task will finish. A volcano can be used in this environment as far as i know.

Maybe I am wrong, and if I am wrong, I apologize for wasting your time. However, I think it is important to clarify the goals of this project and possibly write a decision record with the pros and cons of other solutions, along with some clarification as to why this one is necessary.

NickrenREN · 2024-03-08T03:23:34Z

@Cdayz generally speaking, online workloads are SLA, latency sensitive workloads, such as micro-service workloads, RPC services, and offline workloads are mostly throughput oriented and care more about job completion time, such as Hadhoop batch apps and ML training tasks...

They care about different metrics, the scheduling requirements are different, for example: Hadhoop batch apps need high scheduling throughput (1k pods per second in our prod env), and ML training tasks need Gang, Job level affinity (all tasks in one job need to be scheduled into one Tor or one network segment, any Tor or network segment is ok but can not cross) and some other complex features...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why another scheduler? What purpose of this project? #26

Why another scheduler? What purpose of this project? #26

Cdayz commented Mar 3, 2024

NickrenREN commented Mar 5, 2024 •

edited

Loading

Cdayz commented Mar 7, 2024

NickrenREN commented Mar 8, 2024

Why another scheduler? What purpose of this project? #26

Why another scheduler? What purpose of this project? #26

Comments

Cdayz commented Mar 3, 2024

NickrenREN commented Mar 5, 2024 • edited Loading

Cdayz commented Mar 7, 2024

NickrenREN commented Mar 8, 2024

NickrenREN commented Mar 5, 2024 •

edited

Loading