Initial draft of KEP 6089: Workload Aware Scheduling Controller APIs by mm4tt · Pull Request #6092 · kubernetes/enhancements

mm4tt · 2026-05-18T17:06:22Z

One-line PR description: Initial draft of KEP 6089: Workload Aware Scheduling Controller APIs

Issue link: WAS: Controller Integration APIs #6089

Other comments:

k8s-ci-robot · 2026-05-18T17:06:26Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2026-05-19T22:25:49Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mm4tt
Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t and additionally assign macsko for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wojtek-t

Great KEP - I added few questions & concerns that I still have.

wojtek-t · 2026-05-20T13:53:50Z

+
+## Summary
+
+This KEP proposes a standardized set of reusable API building blocks (`scheduling.k8s.io`), integration guidelines, and shared libraries to simplify how workload controllers (e.g., JobSet, TrainJob, RayJob, LWS, as well as core workloads like Job) integrate with Workload-Aware Scheduling (WAS).


Can you please keep the lines shorter (it helps with readability and having more targeted comments/adjustments).

wojtek-t · 2026-05-20T14:04:21Z

+  * **The Controller owns the Structure:** The real-workload controller (e.g., `JobSet` or `LWS`) fully understands its own shape, hierarchy, and replication mechanics. It does not need the user to manually repeat this structure to the scheduler.
+  * **The User owns the Policy:** The user knows *how* they want the workload to be scheduled based on their specific environment (e.g., "I want gang scheduling", "I need these workers colocated on the same network rack").
+  * **The Controller acts as a Translator:** The real-workload controller consumes the user's high-level policy intent, combines it with its own structural knowledge, and acts as a compiler to generate the low-level `Workload` objects for the scheduler.
+* **Universal Representation:** Legacy, standard pod-by-pod scheduling is represented natively as a first-class citizen (`Basic` mode). Controllers always generate the underlying `Workload` objects, using basic scheduling as the backward-compatible default for standalone Jobs.


nit: s/standalone jobs/true workloads/ ?

wojtek-t · 2026-05-21T12:37:29Z

+
+We introduce a set of standard, reusable structs in the `scheduling.k8s.io` API group. These building blocks represent the core capabilities of Workload-Aware Scheduling. They are designed to be embedded directly into higher-level, controller-specific wrapper API structs.
+
+Importantly, these structures are **hierarchy-agnostic**. The same primitives are used regardless of where they are embedded in a workload tree - whether at the `PodGroup` or `CompositePodGroup` level. TODO(mm4tt@): We should likely abandon this approach and have primities per PodGroup/CompositePodGroup level. We needed agnostic hierarchy when we thought we'd be modeling multi-level tree with PodSubGroup. In that case, depending on the context Job could be modeled either as PodGroup (standalone Job) or PodSubGroup (member of JobSet). With CompositePodGroup the problem disappears and Job will be always modeled as PodGroup. 


I'm not sure I fully agree with this reasoning - for me the chane of PodGroup:PodSubGroup vs CompositePodGroup:PodGroup doesn't really change anything wrt it.

I think what really matters is the concepts that you want to express. Taking TAS as an example - if I want to reflect that a certain portion of my workload needs to be collocated - it doesn't really matter if it will be represented with PG or CPG - the intention is exactly the same and I shouldn't need to learn two different concepts to reflect that.

I can imagine some concepts where the difference will matter - but it should be case by case decision. The default approach should be the "hierarchy agnostic approach", and we should use split that only if we have good arguments where this means different things.

I addressed this TODO, now we have per-level types as discussed.

wojtek-t · 2026-05-21T12:39:35Z

+}
+
+// WorkloadTopologyConstraint describes a desired topological colocation for all pods in the group.
+type WorkloadTopologyConstraint struct {


I understand why we want to have a separate WorkloadSchedulingConstraints struct.

But arguably, the TopologyConstraint should actually be shared (we shouldn't introduce WorkloadTopologyConstraint here) - because that's a concept that reflects the topology (and thus by definition can't diverge from what you reflect in PodGroup).

I see two potential approaches here:

We just re-use the existing structs from scheduling/v1alpha3.

We introduce entirely new API group with all the share-able APIs, and then the library will be responsible for translating that into the core scheduling primitives.

Duplicating everything under scheduling group is not the right approach, just like Wojtek says. Maybe worth double checking with API approvers?

Agreed. TopologyConstraint is now shared.

We introduce entirely new API group with all the share-able APIs, and then the library will be responsible for translating that into the core scheduling primitives.

I discussed this with @liggitt and his recommendation was to have the building blocks under scheduling API group

wojtek-t · 2026-05-21T12:39:52Z

+	// +optional
+	// +k8s:optional
+	// +k8s:unionMember
+  Single *WorkloadSingleDisruptionMode `json:"single,omitempty" protobuf:"bytes,1,opt,name=single"`


nit: indentation

wojtek-t · 2026-05-21T12:55:04Z

+During updates to an active `Job`, the API server validation strategy enforces the following rules:
+* **Immutable Fields (Updates are explicitly rejected):**
+  * Modifying the scheduling mode itself (e.g. changing an active Job from `Basic` to `Gang` scheduling or vice-versa).
+  * Updating `SchedulingConstraints` (topology co-location rules) or `DisruptionMode` after the Job has started.


If we forbid them only after the job is started, you have an inherent race that the job-controller may have already started it (just didn't yet report it).

Why can't we just forbid any mutation for now?

Or at the very least, say that we forbid it for non-suspended jobs.

I agree with Wojtek here, let's start simple with forbiding mutation, it's easier to relax that validation at a later stage, than fixing eventual problems.

wojtek-t · 2026-05-21T12:56:07Z

+  * Updating `SchedulingConstraints` (topology co-location rules) or `DisruptionMode` after the Job has started.
+  * Changing dynamic `ResourceClaims` on running workloads.
+* **Mutable Fields (Updates are allowed):**
+  * Modifying the `WorkloadGangSchedulingPolicy.MinCount` parameter. This can be done either explicitly by updating the `MinCount` field inside the JobSpec, or implicitly by scaling `spec.parallelism` (which automatically adjusts the default `MinCount`).


parallelism change woudl trigger it only if the MinCount is unset, right?

Yes, made it clear in the text.

wojtek-t · 2026-05-22T08:31:37Z

+1. **Detection:** The Job controller's reconcile loop detects the change and fetches the existing `Workload` resource from the API server.
+2. **Tree Reconstruction:** It reconstructs the logical `WorkloadNode` tree, automatically feeding the new `parallelism` count into the `DefaultGangPolicy.MinCount` fallback.
+3. **API Update:** Since `MinCount` in the low-level scheduler `Workload` API is mutable in v1.37+, the Job controller performs an **API Update** on the existing `Workload` resource.
+4. **PodGroup Sync:** The Job controller propagates the new `MinCount` to the corresponding runtime `PodGroup` resource to ensure the scheduler immediately schedules the newly scaled pods.


While this makes sense for Job, it's not the universal pattern. In many cases we don't want to update all existing PodGroups with the new param when it was updated.

I suggest making it clear that it's more of an exception than recommendation here.

wojtek-t · 2026-05-22T08:45:57Z

+
+
+#### 2. Downward Workload Template Mapping via Well-Known Annotations
+If a composite controller delegates runtime `PodGroup` management to child execution controllers, we must solve a crucial coordination problem: **How does a child controller know exactly which `PodGroupTemplate` inside the parent's compiled `Workload` corresponds to its pods?**


I think we need more than that - we need to pass two separate information

[what you describe] Which PodGroupTemplate/CompositePodGroupTemplate should be used by the child workload to create the corresponding PG/CPG

[what's missing] What is the parent CompositePodGroup to which the newly created PG/CPG should be connected.
This may be discoverable if we have 1:1 relation of CPGT<->CPG, but if we create multiple CPGs from a given template, this has to be passed.
The best example for that is probably LWS, where CPG will be created by LWS controller per replica (so there can be many of them) and they are all created from the same CPGT.

Great point! Added.

wojtek-t · 2026-05-22T08:46:58Z

+##### The Solution: Downward Mapping Annotation
+To resolve this template linkage, the root and intermediate orchestrators must pass down the template mapping using a well-known metadata annotation on child templates:
+* **Annotation Key:** `scheduling.k8s.io/pod-group-template`
+* **Value:** The specific mapping identifier that links the child object to its corresponding `PodGroupTemplate` or `CompositeGroupTemplate` within the n-level parent `Workload` resource.


We already made a decision that names across all PGTs/CPGTs within a workload will be unique.

So we should clearly state that it should be name of PGT/CPGT.

soltysh · 2026-05-22T12:00:16Z

+- [KEP-4671: Gang Scheduling using Workload Object](https://kep.k8s.io/4671)
+- [KEP-5710: Workload-aware preemption](https://kep.k8s.io/5710)
+- [KEP-5732: Topology-aware workload scheduling](https://kep.k8s.io/5732)
+- [KEP-6017: CompositePodGroup API](https://kep.k8s.io/6017)


Suggested change

- [KEP-6017: CompositePodGroup API](https://kep.k8s.io/6017)

- [KEP-6012: CompositePodGroup API](https://kep.k8s.io/6012)

is the actual tracking KEP :wink

soltysh · 2026-05-22T12:25:10Z

+}
+
+// WorkloadTopologyConstraint describes a desired topological colocation for all pods in the group.
+type WorkloadTopologyConstraint struct {


I see two potential approaches here:

We just re-use the existing structs from scheduling/v1alpha3.

We introduce entirely new API group with all the share-able APIs, and then the library will be responsible for translating that into the core scheduling primitives.

Duplicating everything under scheduling group is not the right approach, just like Wojtek says. Maybe worth double checking with API approvers?

soltysh · 2026-05-22T12:27:44Z

+type JobSchedulingConfiguration struct {
+	// SchedulingPolicy defines the gang or basic scheduling rules for this Job.
+	// +optional
+	SchedulingPolicy *schedulingv1alpha3.WorkloadSchedulingPolicy `json:"schedulingPolicy,omitempty"`


If the library provides validation, I'd assume we automatically inherit all the changes in the core by the virtue of using them, no? If we should adjust, is probably a separate question, that we'll likely consider on a case by case basis, the part I'm more worried about is will we remember?

soltysh · 2026-05-22T14:13:47Z

+	ResourceClaimTemplateName *string `json:"resourceClaimTemplateName,omitempty"`
+}
+```
+### Job Integration (batch/v1)


Quick note from my conversation with @mm4tt, this is probably the best place to link to from KEP-5547.

Yeah, @helayoty will do that

soltysh · 2026-05-22T14:23:20Z

+During updates to an active `Job`, the API server validation strategy enforces the following rules:
+* **Immutable Fields (Updates are explicitly rejected):**
+  * Modifying the scheduling mode itself (e.g. changing an active Job from `Basic` to `Gang` scheduling or vice-versa).
+  * Updating `SchedulingConstraints` (topology co-location rules) or `DisruptionMode` after the Job has started.


I agree with Wojtek here, let's start simple with forbiding mutation, it's easier to relax that validation at a later stage, than fixing eventual problems.

Signed-off-by: Heba Elayoty <heelayot@microsoft.com>

helayoty · 2026-05-28T16:40:55Z

/area workload-aware

wojtek-t

I have a couple more questions, but overall this looks great now!

wojtek-t · 2026-05-29T13:20:41Z

+owning-sig: sig-scheduling
+participating-sigs:
+  - sig-apps
+status: provisional


implementable

wojtek-t · 2026-05-29T13:24:28Z

+    `replicatedJobs` and their parallelism, whereas a single child `Job` only knows its own pods).
+  * **Ownership & Skip Logic:** Child controllers (like standard `Job`) observe their
+    `OwnerReference` pointing to a registered parent workload and explicitly **bypass** creating
+    any `Workload` objects. This prevents duplicate resource creation and guarantees a single


But they still may need to create a PodGroup. I think it's worth mentioning here.

wojtek-t · 2026-05-29T13:25:23Z

+    source of truth.
+* **Separation of Structure and Policy:** The integration strictly separates real-workload
+  structure from scheduling policies:
+  * **The Controller owns the Structure:** The real-workload controller (e.g., `JobSet` or `LWS`)


It's not the controller that owns the structure - it's the API.

So maybe:
"The true workload API owns the structure"
?

wojtek-t · 2026-05-29T13:31:09Z

+
+This level-specific categorization allows independent API evolution.
+
+The only exception to this division is the `TopologyConstraint` struct (reused directly from


It's the only exception now, but may not be the only exception eventually.

I don't know how to phrase that concisely, but basically if something is representing the "real-world" concept that generally is used verbatim by the scheduling stack - we should reuse it.
If it's more of an abstraction introduced by us - we should duplicate, because we may want to use that differently.

wojtek-t · 2026-05-29T13:33:42Z

+// API Group: scheduling.k8s.io/v1alpha3
+
+// WorkloadPodGroupSchedulingConstraints defines leaf-level scheduling constraints, such as topology.
+type WorkloadPodGroupSchedulingConstraints struct {


unconstructive comment:
I'm not a fan of those names, because they are becoming super long:
WorkloadCompositePodGroupSchedulingConstraints
is 46 characters :)

But as I said - it's unconstructive - I don't have a better suggestion :)

wojtek-t · 2026-05-29T13:47:28Z

+
+To resolve this template and hierarchy mapping without structural API schema changes, the root and
+intermediate orchestrators must propagate these linkages downwards using two well-known metadata
+annotations on the child object templates:


What do you mean by "on the child object templates"?

My assumption is that (taking JobSet as an example), JobSet controller will be setting those two annotations on the Job object when creating it.

If that's true - can you clarify in the KEP? If not, can you explain?

wojtek-t · 2026-05-29T13:50:36Z

+
+#### 2. Library API Definition
+
+```go


Where the library will live?

To faciliate imports from arbitrary ecosystem projects, it definitely can't be k/k - it probably has to be some staging repo.
The question is - do we need a new staging repo for it?

kube-scheduler repo is not a good fit.
The only other alternative that I see is is "component-helpers", but I'm not sure it's a good fit.

@liggitt - thoughts?

Initial draft of KEP 6089: Workload Aware Scheduling Controller APIs

e7ee714

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 18, 2026

k8s-ci-robot requested review from dom4ha and jeremyrickard May 18, 2026 17:06

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels May 18, 2026

github-project-automation Bot added this to SIG Scheduling May 18, 2026

github-project-automation Bot moved this to Needs Triage in SIG Scheduling May 18, 2026

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 18, 2026

wojtek-t self-assigned this May 19, 2026

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 19, 2026

mm4tt force-pushed the kep_6089_was_controller_apis branch 4 times, most recently from a6def00 to a4b2b49 Compare May 20, 2026 13:00

KEP 6089: Address technical TODOs, design principles, and alternatives

619ee1e

mm4tt force-pushed the kep_6089_was_controller_apis branch from a4b2b49 to 619ee1e Compare May 21, 2026 12:02

wojtek-t reviewed May 22, 2026

View reviewed changes

soltysh reviewed May 22, 2026

View reviewed changes

pacoxu mentioned this pull request May 27, 2026

KEP-6012: Add initial KEP docs for CompositePodGroup API #6017

Open

KEP-6089: Address first round of review comments

195a92d

mm4tt marked this pull request as ready for review May 28, 2026 16:00

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 28, 2026

k8s-ci-robot requested a review from palnabarun May 28, 2026 16:00

KEP-6089: Introduce per-level types

4bb5ff3

mm4tt force-pushed the kep_6089_was_controller_apis branch from c0af910 to 4bb5ff3 Compare May 28, 2026 16:11

KEP-6089: add helayoty@ as an author

c436cea

Signed-off-by: Heba Elayoty <heelayot@microsoft.com>

mm4tt mentioned this pull request May 28, 2026

WAS: Controller Integration APIs #6089

Open

4 tasks

k8s-ci-robot added the area/workload-aware Categorizes an issue or PR as relevant to Workload-aware and Topology-aware scheduling subprojects. label May 28, 2026

github-project-automation Bot added this to Workload-aware & Topology-aware Workstream May 28, 2026

github-project-automation Bot moved this to Backlog in Workload-aware & Topology-aware Workstream May 28, 2026

helayoty moved this from Backlog to Needs Review in Workload-aware & Topology-aware Workstream May 28, 2026

wojtek-t reviewed May 29, 2026

View reviewed changes


		## Summary

		This KEP proposes a standardized set of reusable API building blocks (`scheduling.k8s.io`), integration guidelines, and shared libraries to simplify how workload controllers (e.g., JobSet, TrainJob, RayJob, LWS, as well as core workloads like Job) integrate with Workload-Aware Scheduling (WAS).


		We introduce a set of standard, reusable structs in the `scheduling.k8s.io` API group. These building blocks represent the core capabilities of Workload-Aware Scheduling. They are designed to be embedded directly into higher-level, controller-specific wrapper API structs.

		Importantly, these structures are hierarchy-agnostic. The same primitives are used regardless of where they are embedded in a workload tree - whether at the `PodGroup` or `CompositePodGroup` level. TODO(mm4tt@): We should likely abandon this approach and have primities per PodGroup/CompositePodGroup level. We needed agnostic hierarchy when we thought we'd be modeling multi-level tree with PodSubGroup. In that case, depending on the context Job could be modeled either as PodGroup (standalone Job) or PodSubGroup (member of JobSet). With CompositePodGroup the problem disappears and Job will be always modeled as PodGroup.



		#### 2. Downward Workload Template Mapping via Well-Known Annotations
		If a composite controller delegates runtime `PodGroup` management to child execution controllers, we must solve a crucial coordination problem: How does a child controller know exactly which `PodGroupTemplate` inside the parent's compiled `Workload` corresponds to its pods?

	- [KEP-6017: CompositePodGroup API](https://kep.k8s.io/6017)
	- [KEP-6012: CompositePodGroup API](https://kep.k8s.io/6012)


		This level-specific categorization allows independent API evolution.

		The only exception to this division is the `TopologyConstraint` struct (reused directly from

Conversation

mm4tt commented May 18, 2026

Uh oh!

k8s-ci-robot commented May 18, 2026

Uh oh!

k8s-ci-robot commented May 19, 2026

Uh oh!

wojtek-t left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

helayoty commented May 28, 2026

Uh oh!

wojtek-t left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment