Initial draft of KEP 6089: Workload Aware Scheduling Controller APIs#6092
Initial draft of KEP 6089: Workload Aware Scheduling Controller APIs#6092mm4tt wants to merge 5 commits into
Conversation
mm4tt
commented
May 18, 2026
- One-line PR description: Initial draft of KEP 6089: Workload Aware Scheduling Controller APIs
- Issue link: WAS: Controller Integration APIs #6089
- Other comments:
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mm4tt The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
a6def00 to
a4b2b49
Compare
a4b2b49 to
619ee1e
Compare
wojtek-t
left a comment
There was a problem hiding this comment.
Great KEP - I added few questions & concerns that I still have.
|
|
||
| ## Summary | ||
|
|
||
| This KEP proposes a standardized set of reusable API building blocks (`scheduling.k8s.io`), integration guidelines, and shared libraries to simplify how workload controllers (e.g., JobSet, TrainJob, RayJob, LWS, as well as core workloads like Job) integrate with Workload-Aware Scheduling (WAS). |
There was a problem hiding this comment.
Can you please keep the lines shorter (it helps with readability and having more targeted comments/adjustments).
| * **The Controller owns the Structure:** The real-workload controller (e.g., `JobSet` or `LWS`) fully understands its own shape, hierarchy, and replication mechanics. It does not need the user to manually repeat this structure to the scheduler. | ||
| * **The User owns the Policy:** The user knows *how* they want the workload to be scheduled based on their specific environment (e.g., "I want gang scheduling", "I need these workers colocated on the same network rack"). | ||
| * **The Controller acts as a Translator:** The real-workload controller consumes the user's high-level policy intent, combines it with its own structural knowledge, and acts as a compiler to generate the low-level `Workload` objects for the scheduler. | ||
| * **Universal Representation:** Legacy, standard pod-by-pod scheduling is represented natively as a first-class citizen (`Basic` mode). Controllers always generate the underlying `Workload` objects, using basic scheduling as the backward-compatible default for standalone Jobs. |
There was a problem hiding this comment.
nit: s/standalone jobs/true workloads/ ?
|
|
||
| We introduce a set of standard, reusable structs in the `scheduling.k8s.io` API group. These building blocks represent the core capabilities of Workload-Aware Scheduling. They are designed to be embedded directly into higher-level, controller-specific wrapper API structs. | ||
|
|
||
| Importantly, these structures are **hierarchy-agnostic**. The same primitives are used regardless of where they are embedded in a workload tree - whether at the `PodGroup` or `CompositePodGroup` level. TODO(mm4tt@): We should likely abandon this approach and have primities per PodGroup/CompositePodGroup level. We needed agnostic hierarchy when we thought we'd be modeling multi-level tree with PodSubGroup. In that case, depending on the context Job could be modeled either as PodGroup (standalone Job) or PodSubGroup (member of JobSet). With CompositePodGroup the problem disappears and Job will be always modeled as PodGroup. |
There was a problem hiding this comment.
I'm not sure I fully agree with this reasoning - for me the chane of PodGroup:PodSubGroup vs CompositePodGroup:PodGroup doesn't really change anything wrt it.
I think what really matters is the concepts that you want to express. Taking TAS as an example - if I want to reflect that a certain portion of my workload needs to be collocated - it doesn't really matter if it will be represented with PG or CPG - the intention is exactly the same and I shouldn't need to learn two different concepts to reflect that.
I can imagine some concepts where the difference will matter - but it should be case by case decision. The default approach should be the "hierarchy agnostic approach", and we should use split that only if we have good arguments where this means different things.
There was a problem hiding this comment.
I addressed this TODO, now we have per-level types as discussed.
| } | ||
|
|
||
| // WorkloadTopologyConstraint describes a desired topological colocation for all pods in the group. | ||
| type WorkloadTopologyConstraint struct { |
There was a problem hiding this comment.
I understand why we want to have a separate WorkloadSchedulingConstraints struct.
But arguably, the TopologyConstraint should actually be shared (we shouldn't introduce WorkloadTopologyConstraint here) - because that's a concept that reflects the topology (and thus by definition can't diverge from what you reflect in PodGroup).
There was a problem hiding this comment.
I see two potential approaches here:
- We just re-use the existing structs from scheduling/v1alpha3.
- We introduce entirely new API group with all the share-able APIs, and then the library will be responsible for translating that into the core scheduling primitives.
Duplicating everything under scheduling group is not the right approach, just like Wojtek says. Maybe worth double checking with API approvers?
There was a problem hiding this comment.
Agreed. TopologyConstraint is now shared.
We introduce entirely new API group with all the share-able APIs, and then the library will be responsible for translating that into the core scheduling primitives.
I discussed this with @liggitt and his recommendation was to have the building blocks under scheduling API group
| // +optional | ||
| // +k8s:optional | ||
| // +k8s:unionMember | ||
| Single *WorkloadSingleDisruptionMode `json:"single,omitempty" protobuf:"bytes,1,opt,name=single"` |
| During updates to an active `Job`, the API server validation strategy enforces the following rules: | ||
| * **Immutable Fields (Updates are explicitly rejected):** | ||
| * Modifying the scheduling mode itself (e.g. changing an active Job from `Basic` to `Gang` scheduling or vice-versa). | ||
| * Updating `SchedulingConstraints` (topology co-location rules) or `DisruptionMode` after the Job has started. |
There was a problem hiding this comment.
If we forbid them only after the job is started, you have an inherent race that the job-controller may have already started it (just didn't yet report it).
Why can't we just forbid any mutation for now?
Or at the very least, say that we forbid it for non-suspended jobs.
There was a problem hiding this comment.
I agree with Wojtek here, let's start simple with forbiding mutation, it's easier to relax that validation at a later stage, than fixing eventual problems.
| * Updating `SchedulingConstraints` (topology co-location rules) or `DisruptionMode` after the Job has started. | ||
| * Changing dynamic `ResourceClaims` on running workloads. | ||
| * **Mutable Fields (Updates are allowed):** | ||
| * Modifying the `WorkloadGangSchedulingPolicy.MinCount` parameter. This can be done either explicitly by updating the `MinCount` field inside the JobSpec, or implicitly by scaling `spec.parallelism` (which automatically adjusts the default `MinCount`). |
There was a problem hiding this comment.
parallelism change woudl trigger it only if the MinCount is unset, right?
There was a problem hiding this comment.
Yes, made it clear in the text.
| 1. **Detection:** The Job controller's reconcile loop detects the change and fetches the existing `Workload` resource from the API server. | ||
| 2. **Tree Reconstruction:** It reconstructs the logical `WorkloadNode` tree, automatically feeding the new `parallelism` count into the `DefaultGangPolicy.MinCount` fallback. | ||
| 3. **API Update:** Since `MinCount` in the low-level scheduler `Workload` API is mutable in v1.37+, the Job controller performs an **API Update** on the existing `Workload` resource. | ||
| 4. **PodGroup Sync:** The Job controller propagates the new `MinCount` to the corresponding runtime `PodGroup` resource to ensure the scheduler immediately schedules the newly scaled pods. |
There was a problem hiding this comment.
While this makes sense for Job, it's not the universal pattern. In many cases we don't want to update all existing PodGroups with the new param when it was updated.
I suggest making it clear that it's more of an exception than recommendation here.
|
|
||
|
|
||
| #### 2. Downward Workload Template Mapping via Well-Known Annotations | ||
| If a composite controller delegates runtime `PodGroup` management to child execution controllers, we must solve a crucial coordination problem: **How does a child controller know exactly which `PodGroupTemplate` inside the parent's compiled `Workload` corresponds to its pods?** |
There was a problem hiding this comment.
I think we need more than that - we need to pass two separate information
-
[what you describe] Which PodGroupTemplate/CompositePodGroupTemplate should be used by the child workload to create the corresponding PG/CPG
-
[what's missing] What is the parent CompositePodGroup to which the newly created PG/CPG should be connected.
This may be discoverable if we have 1:1 relation of CPGT<->CPG, but if we create multiple CPGs from a given template, this has to be passed.
The best example for that is probably LWS, where CPG will be created by LWS controller per replica (so there can be many of them) and they are all created from the same CPGT.
| ##### The Solution: Downward Mapping Annotation | ||
| To resolve this template linkage, the root and intermediate orchestrators must pass down the template mapping using a well-known metadata annotation on child templates: | ||
| * **Annotation Key:** `scheduling.k8s.io/pod-group-template` | ||
| * **Value:** The specific mapping identifier that links the child object to its corresponding `PodGroupTemplate` or `CompositeGroupTemplate` within the n-level parent `Workload` resource. |
There was a problem hiding this comment.
We already made a decision that names across all PGTs/CPGTs within a workload will be unique.
So we should clearly state that it should be name of PGT/CPGT.
| - [KEP-4671: Gang Scheduling using Workload Object](https://kep.k8s.io/4671) | ||
| - [KEP-5710: Workload-aware preemption](https://kep.k8s.io/5710) | ||
| - [KEP-5732: Topology-aware workload scheduling](https://kep.k8s.io/5732) | ||
| - [KEP-6017: CompositePodGroup API](https://kep.k8s.io/6017) |
There was a problem hiding this comment.
| - [KEP-6017: CompositePodGroup API](https://kep.k8s.io/6017) | |
| - [KEP-6012: CompositePodGroup API](https://kep.k8s.io/6012) |
is the actual tracking KEP :wink
| } | ||
|
|
||
| // WorkloadTopologyConstraint describes a desired topological colocation for all pods in the group. | ||
| type WorkloadTopologyConstraint struct { |
There was a problem hiding this comment.
I see two potential approaches here:
- We just re-use the existing structs from scheduling/v1alpha3.
- We introduce entirely new API group with all the share-able APIs, and then the library will be responsible for translating that into the core scheduling primitives.
Duplicating everything under scheduling group is not the right approach, just like Wojtek says. Maybe worth double checking with API approvers?
| type JobSchedulingConfiguration struct { | ||
| // SchedulingPolicy defines the gang or basic scheduling rules for this Job. | ||
| // +optional | ||
| SchedulingPolicy *schedulingv1alpha3.WorkloadSchedulingPolicy `json:"schedulingPolicy,omitempty"` |
There was a problem hiding this comment.
If the library provides validation, I'd assume we automatically inherit all the changes in the core by the virtue of using them, no? If we should adjust, is probably a separate question, that we'll likely consider on a case by case basis, the part I'm more worried about is will we remember?
| ResourceClaimTemplateName *string `json:"resourceClaimTemplateName,omitempty"` | ||
| } | ||
| ``` | ||
| ### Job Integration (batch/v1) |
| During updates to an active `Job`, the API server validation strategy enforces the following rules: | ||
| * **Immutable Fields (Updates are explicitly rejected):** | ||
| * Modifying the scheduling mode itself (e.g. changing an active Job from `Basic` to `Gang` scheduling or vice-versa). | ||
| * Updating `SchedulingConstraints` (topology co-location rules) or `DisruptionMode` after the Job has started. |
There was a problem hiding this comment.
I agree with Wojtek here, let's start simple with forbiding mutation, it's easier to relax that validation at a later stage, than fixing eventual problems.
c0af910 to
4bb5ff3
Compare
Signed-off-by: Heba Elayoty <heelayot@microsoft.com>
|
/area workload-aware |
wojtek-t
left a comment
There was a problem hiding this comment.
I have a couple more questions, but overall this looks great now!
| owning-sig: sig-scheduling | ||
| participating-sigs: | ||
| - sig-apps | ||
| status: provisional |
| `replicatedJobs` and their parallelism, whereas a single child `Job` only knows its own pods). | ||
| * **Ownership & Skip Logic:** Child controllers (like standard `Job`) observe their | ||
| `OwnerReference` pointing to a registered parent workload and explicitly **bypass** creating | ||
| any `Workload` objects. This prevents duplicate resource creation and guarantees a single |
There was a problem hiding this comment.
But they still may need to create a PodGroup. I think it's worth mentioning here.
| source of truth. | ||
| * **Separation of Structure and Policy:** The integration strictly separates real-workload | ||
| structure from scheduling policies: | ||
| * **The Controller owns the Structure:** The real-workload controller (e.g., `JobSet` or `LWS`) |
There was a problem hiding this comment.
It's not the controller that owns the structure - it's the API.
So maybe:
"The true workload API owns the structure"
?
|
|
||
| This level-specific categorization allows independent API evolution. | ||
|
|
||
| The only exception to this division is the `TopologyConstraint` struct (reused directly from |
There was a problem hiding this comment.
It's the only exception now, but may not be the only exception eventually.
I don't know how to phrase that concisely, but basically if something is representing the "real-world" concept that generally is used verbatim by the scheduling stack - we should reuse it.
If it's more of an abstraction introduced by us - we should duplicate, because we may want to use that differently.
| // API Group: scheduling.k8s.io/v1alpha3 | ||
|
|
||
| // WorkloadPodGroupSchedulingConstraints defines leaf-level scheduling constraints, such as topology. | ||
| type WorkloadPodGroupSchedulingConstraints struct { |
There was a problem hiding this comment.
unconstructive comment:
I'm not a fan of those names, because they are becoming super long:
WorkloadCompositePodGroupSchedulingConstraints
is 46 characters :)
But as I said - it's unconstructive - I don't have a better suggestion :)
|
|
||
| To resolve this template and hierarchy mapping without structural API schema changes, the root and | ||
| intermediate orchestrators must propagate these linkages downwards using two well-known metadata | ||
| annotations on the child object templates: |
There was a problem hiding this comment.
What do you mean by "on the child object templates"?
My assumption is that (taking JobSet as an example), JobSet controller will be setting those two annotations on the Job object when creating it.
If that's true - can you clarify in the KEP? If not, can you explain?
|
|
||
| #### 2. Library API Definition | ||
|
|
||
| ```go |
There was a problem hiding this comment.
Where the library will live?
To faciliate imports from arbitrary ecosystem projects, it definitely can't be k/k - it probably has to be some staging repo.
The question is - do we need a new staging repo for it?
kube-scheduler repo is not a good fit.
The only other alternative that I see is is "component-helpers", but I'm not sure it's a good fit.
@liggitt - thoughts?